Cassandra's schema for making hourly requests

I want to store data in Cassandra from many sources and run the job once an hour to only process data points from a specific hour. What's the best circuit for this?

To avoid a hot spot, I cannot fit all the data for an hour into one partition, so the data for each hour has to be spread across many partitions.

So, I see two ways to allow requests by the hour:

  • Create a new table for each hour and make a selection * of tables without a where clause to read that hour. I think it would be read and write efficient, but it would be very frustrating to manage so many tables.

  • Create a new table every week and provide a column for the hour per week (for example, 1 to 168) and create a secondary index on it. Then I can do select * where hour = x. This seems to work, but I'm worried it won't scale well if there are many lines.

Does anyone know which approach would scale better? Is there a better way to do this?

Thank.

+3


source to share


2 answers


In situations like this, you can use buckets. A bucket is a way of dividing sections into several separate pieces. For example, imagine your diagram looks like this: CREATE KEYSPACE timeseries WITH replication = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }; USE timeseries; CREATE TABLE hourly ( source_id text, hour text, date timestamp, data text, bucket int, PRIMARY KEY ((hour, bucket), date) );

You can then use bucket

to split the clock into, for example, 10 sections using a hash function of some known identifier (for example source_id

).

During the request, you will need to specify hour

and usually all buckets:

SELECT * FROM hourly WHERE hour = '2015-07-20 23:00' AND bucket IN (0,1,2,3,4,5,6,7,8,9);



The hash function is important because you want it to distribute the data evenly across the different partitions, even if the dedicated hashing is not evenly distributed, but you also don't want it to be a very complex function.

This JSFiddle gives you a hashing example that is very simple, distributes data evenly, and can be easily reproduced in any language: http://jsfiddle.net/joscas/yfp72fq5/

Otherwise, modulo your id, or even modulo the epoch time instead of a hash function may suffice instead of a hash function, but if you use modulo an identifier, you should check that the numbers do not end up in a single pattern. On the other hand, if you take a modulo timestamp, you will effectively write everything into one bucket for some time, and this can create hotspots, especially if the number of buckets is small.

+3


source


You have little choice and, as you already figured out, solutions have their drawbacks.

I would surely avoid solution number 2 due to the scalability issue associated with secondary indexes. If you need a solution right now, I would use a lot of tables. If you can wait, I would use Cassandra 3 and materialized views , choosing the appropriate key.



NTN, Carlo

+1


source







All Articles