SQL Shema Design and Backup Strategy

We are using cassandra for an IOT based application. We currently receive 10 GB of data daily. We store all data in a separate table in Cassandra like a time series model. What is the best approach for storing data in one table or multiple tables (year, month).

scheme:

CREATE TABLE SensorData (
    cid text,
    event_date date,
    event_time timestamp,
    data text,
    device_id text,
    device_type text,
    rawdata text,
    PRIMARY KEY ((cid, event_date), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC)

      

Spark List:

  • The need to run client jobs for one date
  • You must run client jobs for the same date. (application allow filtering)
  • Specific work is required for all individual customer data.
  • Specific work is required for all individual customer data.

If the work to increase the size of the data slows down. Do we need to focus on performance metrics (cassandra / spark) or will we store the data in different little talent.

BackUp strategy

What is the best way to make a backup strategy?

+3


source to share


1 answer


From what I can tell you, it seems to be ok when it comes to schematics. If in the future you may receive messages that are at the millisecond level, which you might want to break even lower than the daily level you have now.

But the day is probably ok since sensors rarely send data faster than seconds. I even worked on one project where we split by month and the data came in seconds and it wasn't such a big deal. So from the schema you are ok

Spark jobs are also considered schema.

  • ok since you can get all the data in a day without too many problems
  • I would avoid using filtering, especially if you have 10 GB per day it will only get worse over time. If you provide some details on why you need filtering, I might be able to help. My honest advice would be to avoid all this.
  • This requires you to iterate over the date sections. I suppose my best advice would be to just go back in history day by day. And also you need a smart termination condition. Or it was fixed for all clients (i.e. do not go to the past, let them talk for more than x month). Or you can do it smarter, that is, when you go into the "whole" story for the client, you stop after they say that 10 days the buckets are empty. But this can be tricky, some clients have longer outages. Either way, you should make this customizable.
  • This might be a good answer, but if you are using a spark then it shouldn't be a problem.

With cassandra, it's better to just prepare the data ahead of time. So your scheme works ok for 1 and 2 you are fine. 3 is also great, but 4 is always difficult. By design, if you add 10GB to a set every day and you want to process it all, it will take longer and longer a day. You really can't do this if you want all the data.

Usually in this situation, you do some kind of etl that already does the admissible sum and average information needed for certain units of time. those. if your report is for a whole day, you make new entries in cassandra for that day and save the result. This way you don't have to recycle it again every time. So your problem is not a few smaller tables, but how you design your ETL operations.

For backups, I would suggest the normal cassandra worker thread. What you provided in the link works fine. Never had any problem with this. I also wrote some tools that exported stuff to csv, but that was more for other clients and companies who wanted to do their own processing of the data we had.

Updated answer after additional questions:

Q1: How about the data per day to be truncated monthly

CREATE TABLE SensorData(
  cid text,
  event_date date,
  event_time timestamp,
  data text,
  device_id text,
  device_type text,
  rawdata text,
  PRIMARY KEY ((cid, event_date), event_time, device_id, device_type)
) WITH CLUSTERING ORDER BY (event_time DESC)

      

Q2: Does it make sense to create the following table for historical processing:



CREATE TABLE SensorData_YYYYMM (
  cid text,
  event_date text,
  event_time timestamp,
  data text,
  device_id text,
  device_type text,
  rawdata text, PRIMARY KEY ((cid, event_date), event_time, device_id, device_type) 
) WITH CLUSTERING ORDER BY (event_time DESC)

      

The idea itself is not that bad, but I have a couple of problems. The first is that you will put all the data for one client day in one section. Depending on the number of clients you have received, this may become too large. Typically in an IOT, you want to store data from one sensor in one partition. And you add the time key to the partition key. This makes it relatively easy to create work orders. So basically the key for the first table could be((cid, device_id, event_date) event_time, device_type

Second, if you ever expect two messages from the same device to log in in one millisecond, you might lose data. Therefore, I would advise to switch to the type timeuuid

for event_time. Yes, it requires more, but you are safe from all scenarios where you might lose some readings in the future (when new clients come in, you never know how often they will send). Considering that you are even safe if, for whatever reason, the device concatenates multiple messages to save on bandwidth.

One problem you would have with the first table, if I described, is that it can become problematic to know everything device_id

when you go through it using ETL. I would suggest one table where there would be a list of all device_id

for one client. And every time you Provide a sensor to a client, you also make an entry in this table. Then, when you are doing aggregations, that is, in sparks you can easily combine device_id

with cid

and event_date

to isolate the desired partitions. You should always avoid sparking on all records in the table that are too expensive and, in my opinion, this is one way to limit data traversal. It worked well on one project I was working on.

Now I will start touching on question 2, but will also link to question 1. The point is that normally I would not recommend storing raw data again. This will not make data management easier. I would recommend using the standard cassandra TTL mechanism. Basically, the data will disappear after this time expires. From my experience, raw data is rarely needed for periods longer than a couple of months.

i.e. in one project we used a relational database to store data after the ETL was done because queries were much easier to do and there was no learning curve for data analysts. We saved the data after finishing ETL in the so-called star schema. It worked really well for us.

Basically I would advise you to think about how to aggregate data and then make additional tables in cassandra just for reports. Thus, you will save yourself a lot of processing time.

Another thing you should consider is sensor latency. Sometimes, due to connectivity issues, the sensors even spend days offline. So you must have some kind of policy to handle the data out of order when it comes to etl.

The symptom is that it ignores data out of order. Something in between a reasonable delay before you start work orders. those. start processing the data a couple of hours after midnight to make sure the data is there, and then you do the ETL all day earlier. The hardest part is updating the aggregated ETL data after you find something wrong and then rework it, but I advise against using that.

In the end, I think that an additional month table would not be useful because it will contain the same data and the access pattern will not be different.

+4


source







All Articles