Order by any field in Kassandra

I am looking into cassandra as a possible solution for my upcoming project. The more I research, the more I hear that it is a bad idea to sort fields that are not customizable to sort when the table is created.

Can I sort by any field? If the impact of sorting on non-cluster fields affects performance, what affects performance? I need to sort about 2 million records in a table.

+3


source to share


1 answer


I keep hearing that it's a good idea to sort fields that are not customizable to sort when the table is created.

It is not that this is a bad idea. It's just not possible to make Cassandra sort your data with an arbitrary column. Cassandra requires a query based approach and this also applies to sort order. You must decide ahead of time which queries you want to support Cassandra, and the order in which those queries return their data.

Can I sort by any field?

Here's how Cassandra sorts the result sets: it doesn't. Cassandra queries correspond to the locations of the partitions and the data is read from disk and returned to you. If the data is read in the same order as the sort on disk, the result set will be sorted. On the other hand, if you try a multi-key query, or an index based query where it has to jump to different partitions, it might not be returned in any meaningful order.

But if you plan ahead, you can actually influence the sort order on disk of your data and then use that order in your queries. This can be done using a modeling engine called a "clustering column". Cassandra will let you specify multiple clustering columns, but they are only valid in one section.

So what does this mean? Take this example from DataStax documentation .

CREATE TABLE playlists (
  id uuid,
  artist text,
  album text,
  title text,
  song_order int,
  song_id uuid,
  PRIMARY KEY ((id),song_order))
WITH CLUSTERING ORDER BY (song_order ASC);

      

With this table definition can I request specific playlist

to id

(key section). Within each id

data will be returned by order song_order

:

SELECT id, song_order, album, artist, title 
FROM playlists WHERE id = 62c36092-82a1-3a00-93d1-46196ee77204
ORDER BY song_order DESC;

id                                   | song_order | album                 | artist         | title
------------------------------------------------------------------------------------------------------------------
62c36092-82a1-3a00-93d1-46196ee77204 | 4          | No One Rides For Free |      Fu Manchu |             Ojo Rojo    
62c36092-82a1-3a00-93d1-46196ee77204 | 3          |             Roll Away | Back Door Slam |  Outside Woman Blues
62c36092-82a1-3a00-93d1-46196ee77204 | 2          |          We Must Obey |      Fu Manchu |     Moving in Stereo
62c36092-82a1-3a00-93d1-46196ee77204 | 1          |          Tres Hombres |         ZZ Top |            La Grange

      



In this example, if I need to specify ORDER BY

if I want to switch the sorting direction. Since the strings are stored in ASC

final order, I need to specify DESC

to see them in DESC

final order. If I was fine with returning rows in trailing order ASC

, I don't need to specify at all ORDER BY

.

But what if I want to order by an artist? Or an album? Or both? Since a single artist can have many albums (for this example), we will change the definition of PRIMARY KEY as follows:

PRIMARY KEY ((id),artist,album,song_order)

      

Doing the same query above (minus ORDER BY

) produces this output:

SELECT id, song_order, album, artist, title 
FROM playlists WHERE id = 62c36092-82a1-3a00-93d1-46196ee77204;

id                                   | song_order | album                 | artist         | title
------------------------------------------------------------------------------------------------------------------
62c36092-82a1-3a00-93d1-46196ee77204 | 3          |             Roll Away | Back Door Slam |  Outside Woman Blues
62c36092-82a1-3a00-93d1-46196ee77204 | 4          | No One Rides For Free |      Fu Manchu |             Ojo Rojo    
62c36092-82a1-3a00-93d1-46196ee77204 | 2          |          We Must Obey |      Fu Manchu |     Moving in Stereo
62c36092-82a1-3a00-93d1-46196ee77204 | 1          |          Tres Hombres |         ZZ Top |            La Grange

      

Note that now the lines are ordered artist

and then album

. If we had two songs from the same album, then it song_order

will be next.

So now you may be asking, "What if I just want to sort by album

, not artist

?" You can only sort album

, but not with this table. You cannot omit clustering keys in the ORDER BY clause. To sort only album

(not artist

) you need to create another query table. Sometimes Cassandra data modeling will allow you to duplicate your data multiple times to be able to serve different requests ... and that's okay.

For more details on how to create data models while using clustering order at the same time, see these two articles on PlanetCassandra :

+8


source







All Articles