Cassandra query flexibility

I am new to big data and am currently stuck with a fundamental solution.

For a research project, I need to store millions of log entries per minute in my Cassandra based data center, which works really well. (single data center, 4 nodes)

Log Entry
------------------------------------------------------------------
| Timestamp              | IP1         | IP2           ... 
------------------------------------------------------------------
| 2015-01-01 01:05:01    | 10.10.10.1  | 192.10.10.1   ...
------------------------------------------------------------------

      

Each log entry has a specific time stamp. Log entries should be requested at different time ranges in the first instance. As recommended, I start to "model my query" in a large row approach.

Basic C* Schema
------------------------------------------------------------------
| row key              | column key a         | column key b     ... 
------------------------------------------------------------------
|  2015-01-01 01:05    | 2015-01-01 01:05:01  | 2015-01-01 01:05:23
------------------------------------------------------------------

      

Additional information: Column keys are compound timestamp + uuid to be unique and avoid overwriting; log entries of a specific time are stored next to a node by its identical section key;

This way, the journal entries are kept at short intervals per line. For example, each log entry for is 2015-01-01 01:05

accurate to the minute. Queries are not represented as a range query using an operator <

, and records are fetched as blocks of the specified minute.

Range based queries achieve decent response times, which is fine for me.

Q: In the next step, we want to get more information on queries that are mainly focused on the field IP

. For example: select all records that have IP1=xx.xx.xx.xx

and IP2=yy.yy.yy.yy

.

Thus, it is clear that the current model is quite unsuitable for additional IP-oriented CQL queries. Therefore, the problem is not to find a possible solution, but to use different options for possible technologies that might be possible:

  • Try to fix the problem with standalone C * solutions. (Create a second model and administer the same data in a different form).
  • Choose additional technologies like Spark ...
  • Switching to HDFS / Hadoop - Cassandra / Hadoop Solution ...
  • etc.

With my lack of knowledge in this area, it is quite difficult to find the best way that I should take. Especially considering that using a clustered computing infrastructure would be overkill.

+3


source to share


1 answer


As I understood your question, your table schema looks like this:

create table logs (
  minute timestamp,
  id timeuuid,
  ips list<string>,
  message text,
  primary key (minute,id)
);

      

With this simple diagram, you:

  • can display all logs for a certain minute.
  • can display short inter-minute ranges of log events.
  • want to query dataset by IP.


From my point of view, there are several ways to implement this idea:

  • create a secondary index by IP addresses. But in C * you will lose the ability to query by timestamp: C * cannot merge primary and secondary indexes (e.g. mysql / pgsql).
  • denormalize data. Write log events to two tables at the same time, first optimizing for timestamp requests (minute + ts as PK), second for IP based requests (IP + ts as PK).
  • use the spark for analytic queries. But spark will have to do a (full?) Table scan (in a great distributed map distribution, but it's a table scan nonetheless) every time to fetch all the requested data, so all your queries will take a long time to finish. This method can cause problems if you plan on having a lot of low latency requests.
  • use an external index like ElasticSearch for the query, and C * for storing the data.

In my opinion, the C * way of doing this kind of thing is to have a set of separate tables for different queries. This will give you the ability to run faster queries (but with increased storage costs).

+2


source







All Articles