Duplicate entries when Spark tasks fail

Question

Duplicate entries when Spark tasks fail

I ran into duplicate entries in the Cassandra table when spark tasks fail and restarted again. The schema of the table I am trying to insert.

CREATE TABLE duplicate_record (object_id bigint,entity_key timeuuid,
PRIMARY KEY (object_id, entity_key));

Example of duplicate records in a table

1181592431 uuid
1181592431 uuid1
8082869622 uuid2
8082869622 uuid3

I have a df that is created by a left join between Oracle and Cassandra. Thus, we already have records in Cassandra and new records that are generated by Oracle. I am applying a map of each entry to see if the entity_id exists. If it exists, use it differently so that new entries create a new entity_id and then save. I am using saveToCassandra to insert this df into Cassandra.

When the task fails and restarts, the already inserted records are reinserted with a different entity_key. I think that the inserted record during successful execution is not available when the task is resubmitted, resulting in duplicate records.

+3

cassandra apache-spark spark-cassandra-connector

Sandeep shetty 22 Mar 17 at 9:55

source to share