Cassandra Performance SELECT by id or SELECT by nothing
I am wondering if the speed of C * s depends on SELECT
how we select entire final tables.
For example, we have this table
id | value
A | x
A | xx
B | xx
C | xxx
B | xx
It would be faster to get all the results if we do
SELECT * FROM Y WHERE id='A'
SELECT * FROM Y WHERE id='B'
SELECT * FROM Y WHERE id='C'
or it will be faster if we do
SELECT * FROM Y WHERE 1
or maybe it will be faster if we do
SELECT * FROM Y WHERE id IN ('A', 'B', 'C')
Or they will be equally fast (if we miss the connection time)
source to share
Not sure what your family (table) of your column looks like, but your sample data would never exist in Cassandra. Primary keys are unique and if id
is your primary key the last record will be defeated. Basically, the table will look something like this:
id | value
A | xx
C | xxx
B | xx
As for your individual requests ...
SELECT * FROM Y WHERE 1
This might work well with 3 lines, but it won't if you have 3 million, all spread across multiple nodes.
SELECT * FROM Y WHERE id IN ('A', 'B', 'C')
It's definitely not faster. See my answer here on why relying on IN
for anything other than occasional OLAP usage is not a good idea.
SELECT * FROM Y WHERE id='A'
SELECT * FROM Y WHERE id='B'
SELECT * FROM Y WHERE id='C'
This is definitely the best way. Cassandra is designed to ask for a specific unique sharing key. Even if you want to query every row in a column family (table), you still provide it with a specific section key. This will help your driver quickly determine which node (s) to send the request to.
Now let's say you have 3 million lines. For your application, is it faster to query each individual or just do it SELECT *
? It might be faster in terms of the request, but you still have to go through each one (client side). This means that they manage all of them within the limits of your JVM's available memory (which probably means they are nudging them in some way). But this is a bad (extreme) example, because you never want to send your client application 3 million lines to work with.
The bottom line is that you will have to discuss these issues yourself and within the specifications of your application. But from a performance standpoint, I've noticed that appropriate query-based data modeling tends to outweigh query strategy or syntax tricks.
source to share