Why can't I add WHERE clauses to Cassandra after filtering the primary key?

Question

Why can't I add WHERE clauses to Cassandra after filtering the primary key?

EDIT * Thanks for formatting the code, stranger, I'll keep the future in mind!

I am following the main Cassandra tutorial from planetcassandra.org and I don't understand why I cannot complete the following query:

select * 
from users 
where lastname = 'Smith' AND city = 'X';

in this table:

CREATE TABLE users 
(
    firstname text,
    lastname text,
    age int,
    email text,
    city text,
    PRIMARY KEY (lastname)
);

In my opinion the section key (lastname) separates the data. So, all lines named Smith should be on node X. What's stopping me from filtering these lines even further down town?

Thank!

+3

cassandra nosql datastax

Piedpiper Apr 17 15 at 18:23

source to share

2 answers

Short answer

You will need a clustered column - city.

Update . Sorry for the short answer. Let me put this in a bit.

Cassandra stores data sequentially on disk (quick dive into the C * read path)

Cassandra is built from the ground up as a distributed system designed for high performance and availability. Even though SQL Server is based on SQL, CQL is limited in the kind of queries you can and cannot do, and often you have to build your data model around the query pattern (and duplicate data) around your load / access patterns.

True, once you specify the section key in the cql where clause, cassandra knows that node is storing your data. However, it still needs to find the data in that node.

Remember that C * stores data sequentially based on clustering columns. To find the CQL string you're looking for, cassandra will have to do full disk searches, which are slow after sale and have a lot of data. If you have clustered columns x, y, and z, the data is sorted by the three clustering columns, respectively. This is why you can only include where the constraints for x, y and z are consistent.

Check out this data modeling tool to visualize data models at the c * storage tier to see possible queries and create stress yams.

+3

phact Apr 17 15 at 18:39

source to share

nickmbailey · Accepted Answer · 2015-04-17T20:21:02+0000

There are two answers to your question here. One specific for your example and more general answer (which you probably really are after this).

Answer for your example

In your specific example, you have a single primary key "lastname". Thus, in this case, there is only one line per section. Each time you update a row named "Smith", you overwrite any previous data on that row. In this case, the where clause really doesn't make sense, because when you ask for the Smith string, there will only be one result.

More general answer

I assume you meant that your example allows for more than one line per section. Perhaps something like PRIMARY KEY (lastname, user_id) (or any column in the clustering key that will allow you to identify individual users with the same name).

Partitions can be quite large in Kassandra. Potentially millions of lines in one section. The clustered columns in your primary key are what determines the ordering of those rows when stored on disk. So when you run a query on a clustering column, Cassandra can use this knowledge of data ordering to find the exact data you're looking for.

If Cassandra allows you to query for columns that are not in the clustering key, it will need to scan all data in the section and validate each row as you request. This would be extremely ineffective.

To expand column clustering even further, the actual order of the clustering columns is also important. Order determines how strings are stored on disk, as indicated above. Thus, "PRIMARY KEY (a, b, c)" and "PRIMARY KEY (a, c, b)" do not match. In the first example, rows are ordered on disk first by column "b", and then all columns with the same value for column "b" are ordered by column "c". This means that you cannot query columns within a section with a specific value for "c" without specifying "b" as well. This query will again require a scan of the entire section, since the lines are first ordered with the letter "b".

Knowing the exact queries you want to execute will help you determine the clustering key you need and whether you need to denormalize it across multiple tables to support multiple queries.

Why can't I add WHERE clauses to Cassandra after filtering the primary key?

Short answer

Cassandra stores data sequentially on disk (quick dive into the C * read path)

More articles: