Count the different values

consider the following table:

CREATE TABLE users (
  date timestamp,
  user_id text,
  PRIMARY KEY (date, user_id)
);

      

with the following data, for example:

date       user_id

25Aug2013    1
25Aug2013    2
25Aug2013    1
25Aug2013    3

26Aug2013    1
26Aug2013    2

27Aug2013    2
27Aug2013    3
27Aug2013    4

28Aug2013    1
28Aug2013    2
28Aug2013    1
28Aug2013    3

      

How can I count the number of unique user_ids?

+3


source to share


2 answers


The idea might be to use set collection :

CREATE TABLE stats_unique (
  stat_group text,
  user_ids set<text>,
  PRIMARY KEY (stat_group)
);

      

Inserts will automatically remove duplicates from the collection, and the selection will fetch all IDs at the same time, so you count at the application level.



If you're only interested in the number of unique user_ids without actually fetching them from disks, I'm afraid you'll have to tweak your application code a bit.

And don't forget to explore the limitations in detail.

+1


source


In the comments, I have mentioned more or less material related to the question, but I would like to make a comment.

Personally, when I was in a similar situation with cassandra, I was overusing the properties it has, which is kind of a hack, but I figured it might be "useful" in this context.

Basically, I created a single side table in which I put all the unique things. i.e.

CREATE TABLE stats_unique (
  stat_group text,
  user_id text,
  PRIMARY KEY (stat_group, user_id)
);

      

Scriptures are usually cheap, and I had no problem with the extra simple ones, because cassandra was created for that. So every time I insert to the base table I also insert to the table stats_unique

. For your example, it would be something like:

INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '1');
INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '2');
INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '1');
INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '3');

INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '1');
INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '2');

INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '2');
INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '3');
INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '4');

INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '1');
INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '2');
INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '1');
INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '3');

      



And then when I needed uniques, I just wrote a simple query:

SELECT COUNT(1) FROM stats_unique WHERE stat_group = 'users';

 count
-------
     4

(1 rows)

      

This is by no means a standard solution, but it was something that worked for my particular case. Please note that I could not hold more than two million pieces in this separate section but the system simply did not need to support such entity instances so it was good enough for my use. Also with this hack you might run into issues like timeouts for counting etc.

It would be better to have something on the side to do this, either a separate process, a script, or even as Ashrafi Islam outlined it in its comment a spark process that will make a score for you and take it to another table in cassandra or other technology storage.

What I used might be cassandra anti pattern (hotline, etc), but that worked for me.

0


source







All Articles