Why is HBase counting operation so slow

Command:

count 'tableName'. 

      

It is very slow to get the total row number of the whole table.

My situation:

  • I have one master and two slaves, each node with 16 cpus and 16G operations.

  • My table only has one column family with two columns: title and Content.

  • The header column is more than 100B bytes, the content can be 5M bytes.

  • Now the table has 1550 rows, every time I count the row number it will take about 2 minutes.

I am very curious why hbase is so slow in this operation, I think it is even slower than mysql. Is Cassandra faster than HBase in these operations?

+3


source to share


2 answers


First of all, you have a very small amount of data. If you have that volume then IMO using NoSql won't give you any advantage. Your test is not suitable for evaluating HBase and Cassandra. Both have their own use cases and sweet spots.

Command

count on hbase runs one end-to-end java program to count lines. However, I am surprised that it takes 2 minutes to count 1550 rows. If you want to make the count faster (for a larger dataset), you must run the MapReduce job in HBase Row_Counter.
Start the MapReduce job by doing the following:



bin / hbase org.apache.hadoop.hbase.mapreduce.RowCounter

+4


source


First of all, remember that in order to use data localization your "slaves" (better known as RegionServers) must also have a DataNode role, but this does not mean that this is a performance killer.

Because of the expected performance, HBase does not support a direct row count. To do the count, the HBase wrapper client needs to get all the data, which means if your middle row contains 5M of data, then the client will fetch 5M * 1550 from the server registers just for counting, which is a lot.



To speed things up, you have 2 options:

  • If you want real-time answers, you can maintain your own direct line counter using HBase atomic counters: every time you insert, you increment the counter, and each time you delete, you decrement the counter. It might even be in the same table, just use a different column family to store it.

  • If you don't need to run the distributed row pruned mapping table ( source code ) in real time , forcing the scan, just use the minimum number of columns and columns to avoid reading large rows, each RegionServer will read locally stored data and no network I / O is required. In this case, you may need to add a new column to your rows with a small value if you don't already have one (the boolean option is your best option).

+1


source







All Articles