Cassandra query flexibility
I am new to big data and am currently stuck with a fundamental solution.
For a research project, I need to store millions of log entries per minute in my Cassandra based data center, which works really well. (single data center, 4 nodes)
Log Entry
------------------------------------------------------------------
| Timestamp | IP1 | IP2 ...
------------------------------------------------------------------
| 2015-01-01 01:05:01 | 10.10.10.1 | 192.10.10.1 ...
------------------------------------------------------------------
Each log entry has a specific time stamp. Log entries should be requested at different time ranges in the first instance. As recommended, I start to "model my query" in a large row approach.
Basic C* Schema
------------------------------------------------------------------
| row key | column key a | column key b ...
------------------------------------------------------------------
| 2015-01-01 01:05 | 2015-01-01 01:05:01 | 2015-01-01 01:05:23
------------------------------------------------------------------
Additional information: Column keys are compound timestamp + uuid to be unique and avoid overwriting; log entries of a specific time are stored next to a node by its identical section key;
This way, the journal entries are kept at short intervals per line. For example, each log entry for is 2015-01-01 01:05
accurate to the minute. Queries are not represented as a range query using an operator <
, and records are fetched as blocks of the specified minute.
Range based queries achieve decent response times, which is fine for me.
Q:
In the next step, we want to get more information on queries that are mainly focused on the field IP
. For example: select all records that have IP1=xx.xx.xx.xx
and IP2=yy.yy.yy.yy
.
Thus, it is clear that the current model is quite unsuitable for additional IP-oriented CQL queries. Therefore, the problem is not to find a possible solution, but to use different options for possible technologies that might be possible:
- Try to fix the problem with standalone C * solutions. (Create a second model and administer the same data in a different form).
- Choose additional technologies like Spark ...
- Switching to HDFS / Hadoop - Cassandra / Hadoop Solution ...
- etc.
With my lack of knowledge in this area, it is quite difficult to find the best way that I should take. Especially considering that using a clustered computing infrastructure would be overkill.
source to share
As I understood your question, your table schema looks like this:
create table logs (
minute timestamp,
id timeuuid,
ips list<string>,
message text,
primary key (minute,id)
);
With this simple diagram, you:
- can display all logs for a certain minute.
- can display short inter-minute ranges of log events.
- want to query dataset by IP.
From my point of view, there are several ways to implement this idea:
- create a secondary index by IP addresses. But in C * you will lose the ability to query by timestamp: C * cannot merge primary and secondary indexes (e.g. mysql / pgsql).
- denormalize data. Write log events to two tables at the same time, first optimizing for timestamp requests (minute + ts as PK), second for IP based requests (IP + ts as PK).
- use the spark for analytic queries. But spark will have to do a (full?) Table scan (in a great distributed map distribution, but it's a table scan nonetheless) every time to fetch all the requested data, so all your queries will take a long time to finish. This method can cause problems if you plan on having a lot of low latency requests.
- use an external index like ElasticSearch for the query, and C * for storing the data.
In my opinion, the C * way of doing this kind of thing is to have a set of separate tables for different queries. This will give you the ability to run faster queries (but with increased storage costs).
source to share