Efficient access to ordered results in Kassandra

Question

Efficient access to ordered results in Kassandra

I am trying to translate a relatively general requirement in SQL for an efficient data model in Cassandra. I am trying to figure out the best way to model my data so that I can order my strings in Cassandra in the same order to communicate this in the application. Typically this would be a good example for a clustering column, except that the data I want to order my result on is a metric that will be updated multiple times a day.

I'm going to explain the problem in SQL and then share what approaches to data modeling happened to me. What I would like to know is if anyone came across a similar requirement for mine, and if so how did you model the data in Cassandra.

Here is the problem I am trying to solve.

Suppose I have a raw_data table defined like this:

CREATE TABLE raw_data (
  A varchar,
  B varchar,
  C varchar,
  D varchar,
  ts timestamp,
  val varint
  PRIMARY KEY (ts,A,B,C,D)
);

And I also have a pivot table

CREATE TABLE summary_table (
  A varchar,
  B varchar,
  C varchar,
  total_val varint
  PRIMARY KEY (A,B,C)
);

If my pivot table data is aggregated by my application in a way that matches

SELECT A, B, C, SUM(val) FROM raw_data GROUP BY A, B, C

What I want to do is execute a query like:

SELECT B, C, total_val FROM summary_table WHERE A = "Something" ORDER BY total_val DESC LIMIT 1000;

That is, I want to multiply a pivot table for a specific value A and then return the top 1000 rows, ordered by total_val

Total_val is updated every few minutes by my application as additional data is being pushed into my raw_data table. So I cannot use total_val as clustering column for my data

I'm trying to figure out the best way to model this type of problem in Cassandra - one in which I need to multiply a pivot table with WHERE CLAUSE and order the result set (which is constantly updated) in DESC.

You might expect some of the result sets to be quite large - a few hundred thousand rows (that is, there are some values for A in my pivot table that SELECT COUNT(*) FROM summary_table WHERE A = "some value"

will be very, very large, in the hundreds of thousands). Obviously it is inefficient to sort this data and discard it before sending it to my application.

Also, it doesn't seem like a good option for secondary indexes. On smaller result sets, they are very effective. For the larger ones, they are lagging behind and I suspect there might be a better way to deal with this problem.

Another way I looked at in modeling is to cache large result sets in memory, so at least where I would need to sort many thousands of rows, I would at least do it in memory. I also considered having a secondary summary table that was already populated with the top 1000 rows that I want to open for my application ... although I cannot think of a good way to keep this data up to date and avoid the exact same problem as me with my original pivot table.

Has anyone come across a problem like this where you need to filter pivot data with a WHERE clause and order your (frequently changing) results in a Desc order? If so, have you found a way to do it where certain WHERE clauses will return many thousands of rows? If so, how did you do it?

+3

cassandra data-modeling

evanv Sep 16 14 at 15:33

source to share

1 answer

rs_atl · Accepted Answer · 2014-09-16T20:10:03+0000

The best way I can do it is this:

CREATE TABLE summary_table (
  time_bucket long,
  A varchar,
  total_val int,
  timestamp long,
  B varchar,
  C varchar,
  PRIMARY KEY ((time_bucket, A), total_val, timestamp, B, C)
) WITH CLUSTERING ORDER BY (total_val DESC);

In this structure, you are not actually overwriting total_val

. Instead, you insert a new row for each new value and then discard everything but the last timestamp during the query. The value time_bucket

should be your timestamp rounded to some interval that you can calculate at the time of the request (you may have to request several codes at a time, but try to limit this to only two if possible). If you're wondering, time_bucket

u A

will become your section key, which will prevent the row from growing indefinitely over time.

In other words, you've turned your pivot table into time series data. If necessary, you can add TTL to the old columns so that they die naturally. As long as your buckets of time are normal, you will not face the problem of requesting a large number of gravestones.

Efficient access to ordered results in Kassandra

More articles: