Deduplicating events using Cassandra

I am looking for the best way to deduplicate events using Cassandra.

I have a lot of clients getting event id (thousands per second). I need each event ID to be processed once and only once with high reliability and high availability.

So far, I have tried two methods:

  • Use the event ID as the section key and do "INSERT ... IF NOT EXISTING". If this fails, the event is a duplicate and can be discarded. This is a good clean approach, but the throughput is low due to Paxos, especially with higher replication rates such as 3. It is also fragile because IF NOT EXISTS always requires quorum to work, and there is no way to fall back to a lower one if quorum is not available ... Thus, the pair of lower nodes completely blocks the processing of any event.

  • Allow clients to collide with the same event ID and then detect collision using the clustering column. So insert the event id as the section key and the client generates timeuuid as the clustering column. Then the client will wait for a while (in case other clients insert the same partition key) and then read the event ID with a limit of 1 to return the oldest cluster row. If timeuuid returns responses, then it will be the "winner" and handle the event. If timeuuid does not match, then it is a duplicate and can be removed.

The collision method (baker's algorithm) has much better throughput and availability than using IF NOT EXISTS, but it is more complex and more risky. For example, if the system clock on the client is not running, then the duplicate event will appear as not a duplicate. All my client and Cass nodes use NTP, but this is not always ideal for clock synchronization.

Anyone have a suggestion for which approach to use? Is there any other way to do this?

Also note that my cluster will be configured with three datacenters with about 100ms latency between DCs.

Thank.

+3


source to share


3 answers


IF DON'T EXIST, does not scale, and also stocks Cassandra (because coordination is slow, but you know it), but this is probably the "official, correct" way to do it. There are two other methods that "work":

1) Use an external locking system (zookeeper, memcached CAS, etc.) that allows OUTSIDE cassandra coordination to be handled.



2) Use an ugly hack with an inverted time stamp to make the first entry win. Instead of using the client's timestamp to match the actual wall time, use MAX_LONG - (wall time) = timestamp. Thus, the first entry has the highest "time stamp" and will take precedence over subsequent writes. This method works, although it plays havoc with things like DTCS (if you are doing time series and want to use DTCS, don't use this method, DTCS will be terribly confusing) and deletion altogether (if you ever want to REALLY DELETE the tombstone line REAL, you will also have to write this headstone with an artificial time stamp.

It's worth noting that there have been attempts to address the last-write-always-wins nature of cassandra - for example, CASSANDRA-6412 (which I was working on at some point, and will likely climb again in the next month or so).

+2


source


Maybe digress here, but have you tried redis distributed locks http://redis.io/topics/distlock with event_id based edging using Twemproxy as a proxy to redis if your loads are too high.



+1


source


I think your second best out of all the suggested solutions. But instead of storing only the oldest value in a clustered column, I would store all events to keep it ordered by history from oldest to newest (when inserting, you don't have to check if it already exists and is the oldest, etc., then you can choose the oldest writetime attribute). Then I would choose the oldest one for processing, as you wrote. Since cassandra doesn't see the difference between insert or upsert, I don't see any alternatives to do it with cassandra or as someone said - do it from the outside.

+1


source







All Articles