HBase cluster with damaged region file on HDFS

Question

HBase cluster with damaged region file on HDFS

We have this HBase cluster: more than 30 nodes, 48 tables, 40 + TB at HDFS level, replication factor 2. Due to disk failure on two nodes, we have a corrupted file on HDFS.

Current state of HDFS

Output output hdfs fsck /

that shows corrupted HBase area file:

/user/hbase/table_foo_bar/295cff9c67379c1204a6ddd15808af0b/n/ae0fdf7d0fa24ad1914ca934d3493e56: 
 CORRUPT blockpool BP-323062689-192.168.12.45-1357244568924 block blk_9209554458788732793
/user/hbase/table_foo_bar/295cff9c67379c1204a6ddd15808af0b/n/ae0fdf7d0fa24ad1914ca934d3493e56:
 MISSING 1 blocks of total size 134217728 B

  CORRUPT FILES:        1
  MISSING BLOCKS:       1
  MISSING SIZE:         134217728 B
  CORRUPT BLOCKS:       1

The filesystem under path '/' is CORRUPT

Lost data is not recovered (disks are broken).

Current state of HBase

According to HBase, on the other hand, everything is fine and dandy

hbase hbck

is talking:

Version: 0.94.6-cdh4.4.0
...
 table_foo_bar is okay.
   Number of regions: 1425
   Deployed on:  ....
...
0 inconsistencies detected.
Status: OK

Also, it seems that we can still request data from the unallocated chunks of the corrupt region file (as far as I think I was able to check based on the start and end line key in the region).

Next steps

Since the data of the file block cannot be recovered, it seems the only way to delete the complete damaged file (using hadoop fs -rm

or hadoop fsck -delete /

). This will "fix" HDFS corruption.
However, I am afraid that deleting the HDFS file will lead to corruption at the HBase level as the complete region file will disappear.
I've considered hadoop fsck -move /

to move the corrupted file over /lost+found

and see how HBase takes that, but going to is /lost+found

not reversible as it seems , so I hesitate about it.

Concrete questions:

Should I just delete the file? (Loss of data corresponding to this region is good enough for us.) What bad things happen when you manually delete the HBase region file in HDFS? Is it just deleting data or injecting ugly metadata corruption in HBase that needs to be taken care of as well?

Or can we actually leave the situation as it is, which seems to be working for the moment (HBase does not complain / sees corruption)?

+3

hbase hadoop hdfs corruption fsck

Stefaan 23 june 15 at 18:53

source to share

3 answers

Dan M · Answer 1 · 2015-06-26T23:19:09+0000

We had similar situations: 5 missing blocks, 5 corrupted files for the HBase table.
HBase version: 0.94.15
distro: CDH 4.7
OS: CentOS 6.4

Recovery instructions:

switch to user hbase: su hbase
hbase hbck -details

to understand the scale of the problem.
hbase hbck -fix

to try to recover from regional inconsistencies.
hbase hbck -repair

tried to auto repair but actually increased the inconsistencies by 1
hbase hbck -fixMeta -fixAssignments
hbase hbck -repair

these time tables have been restored
hbase hbck -details

to confirm the correction.

At this point HBase was healthy, added an extra region and removed the corrupted files. However, HDFS still had 5 corrupted files. Since they no longer referenced HBase, we removed them:

switch to hdfs user: su hdfs
hdfs fsck /

to understand the scope of the problem.
hdfs fsck / -delete

delete only corrupted files
hdfs fsck /

to confirm the state of health

NOTE: it is important to completely stop the stack before resetting the caches
(stop all thrift, hbase, zoo keeper, hdfs services and start them again in the reverse order).

[1] Cloudera page for hbck command:
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/admin_hbck_poller.html

Stefaan · Answer 2 · 2015-06-24T21:48:17+0000

FYI: I decided to bite the bullet and manually removed the corrupted file from HDFS with:

hdfs dfs -rm /user/hbase/table_foo_bar/295cff9c67379c1204a6dd....

( hdfs fsck -move

doesn't work for me, don't know why)

After that I checked the HBase health with the help hbck

but no inconsistencies were found

$ hbase hbck
...
0 inconsistencies detected.
Status: OK

So, in our case, manually deleting the realm file did not damage HBase, if I understand correctly, which is nice but confusing. (I hope this will not backfire and corruption will not show up at a later point in time)

issue closed

Your mileage may vary.

UserszrKs · Answer 3 · 2017-08-24T05:29:56+0000

If inconsistencies are found at the scope level, use the -fix argument to direct hbck to try to fix them. The following sequence of steps is performed:

$ ./bin/hbase hbck -fix

-fix includes

A standard non-conformance check is performed.
If necessary, repair work is carried out in the tables
If necessary, repair work is carried out in the regions. Regions are closed during repairs.

So, before running -fix, if you want to fix individual inconsistencies at the region level separately

-fixAssignments (equivalent to 0.90 -fix option) restores unassigned, misassigned, or reassigned regions.

-fixMeta, which removes meta lines when the corresponding regions are not present in HDFS, and adds new meta lines if they are present in HDFS and not META.

-fix includes {-fixAssignments and -fixMeta}

 $ ./bin/hbase hbck -fixAssignments
 $ ./bin/hbase hbck -fixAssignments -fixMeta

There are several classes of table integrity problems that are classified as low-risk repairs. The first two are degenerate (startkey == endkey) regions and back regions (startkey> endkey). They are automatically processed by passing data to a temporary directory (/ hbck / xxxx). The third low risk class is hdfs area holes. This can be restored using:

-fixHdfsHoles to create new empty areas in the file system. If holes are found, you can use -fixHdfsHoles and must include -fixMeta and -fixAssignments to make the new region consistent.

 $ ./bin/hbase hbck -fixAssignments -fixMeta -fixHdfsHoles

-repairHoles include {-fixAssignments -fixMeta -fixHdfsHoles}

 $ ./bin/hbase hbck -repairHoles

HBase cluster with damaged region file on HDFS

Current state of HDFS

Current state of HBase

Next steps

More articles: