Strange Cassadra ReadTimeoutExceptions, depending on which client is requesting

I have a cluster of 3 Cassandra nodes with more or less default configuration. Also, I have a web tier made of two nodes for load balancing, with both websites constantly requesting Cassandra. After some time, when the data stored in Cassandra becomes non-trivial, one and only one of the websites began to receive ReadTimeoutException

a specific request. The websites are identical in every respect.

The query is very simple ( ?

is a placeholder for a date, usually a few minutes before the current moment):

SELECT * FROM table WHERE time > ? LIMIT 1 ALLOW FILTERING;

      

The table is created using this query:

CREATE TABLE table (
    user_id varchar,
    article_id varchar,
    time timestamp,
    PRIMARY KEY (user_id, time));
CREATE INDEX articles_idx ON table(article_id);

      

When the timeout expires, the client waits for a bit more than 10 seconds, which is no surprise the timeout is configured in cassandra.yaml

for most connections and reads.

There are several things that puzzle me:

  • the request only timeouts when one of the sites fulfills it - one of the sites always fails, one of the sites always succeeds.
  • the request comes back instantly when I run it from cqlsh

    (although it seems like it only hits one node when I run it)
  • there are other issued requests that take 2-3 minutes (much more than a 10s timeout) that have no timeout at all

I cannot track the request in Java because it is timed out. Tracking the request in cqlsh

didn't provide much insight. I would prefer not to change the Cassandra timeouts as this is a production system and I'd like to exhaust non-invasive options first. Cassandra nodes have a lot of heaps, the heap is far from complete and GC times seem to be normal.

Any ideas / directions would be much appreciated, I'm totally out of ideas. Cassandra version 2.0.2 using com.datastax.cassandra:cassandra-driver-core:2.0.2

Java client.

+3


source to share


1 answer


A few notes I noticed:

  • As long as you use time

    clustering as the key, it won't help you because your query is not limited to your section key ( user_id

    ). Cassandra only orders clustering of keys within a partition. So right now, your query is returning the first row that satisfies your WHERE clause, ordered by the hashed token value user_id

    . If you do have tens of millions of rows, I would expect this query to drop data from the same user_id

    (or the same selection) every time .

  • "although it only seems to hit one node when I run it from there . " Actually your requests should only hit one node at startup. Injecting network traffic into a request makes it very slow. I think the default consistency in cqlsh is ONE. This is where Carlo's game begins.

  • What is the power article_id

    ? Remember that secondary indexes perform best on the "medium road". High (unique) and low (boolean) are both bad.

  • The ALLOW FILTERING clause should not be used in (production) application code. As usual. If you have 50 million rows in this table, then ALLOW FILTERING pulls all of them first and then truncates the result set based on the WHERE clause.

Suggestions:

  • Carlo may be on to something with a suggestion to try a different (lower) level of consistency. Try setting the compatibility level ONE

    in your application and see if that helps.

  • Either run an ALLOW FILTERING query or a secondary index query. They both suck, but they definitely don't do both together. I wouldn't use either. But if I had to choose, I would expect the secondary index query to suck less than the ALLOW FILTERING query.

  • To solve this problem at the scale you describe, I would duplicate the data in the query table. It sounds like you are interested in organizing time sensitive data and getting the most recent data. A query table like this should do it:

    CREATE TABLE tablebydaybucket ( user_id varchar, article_id varchar, time timestamp, day_bucket varchar, PRIMARY KEY (day_bucket , time)) WITH CLUSTERING ORDER BY (time DESC);



Populate this table with your data and then this query will work:

SELECT * FROM tablebydaybucket 
WHERE day_bucket='20150519' AND time > '2015-05-19 15:38:49-0500' LIMIT 1;

      

This will split your data into day_bucket

and group your data with time

. This way you don't need ALLOW FILTERING or secondary index. It's also guaranteed that your query only hits one node, and Cassandra won't have to pull all your rows back and apply the WHERE clause after the fact. And clustering on the time

DESCending order will help you get back to the most recent lines faster.

+3


source







All Articles