Duplicate entries when Spark tasks fail
I ran into duplicate entries in the Cassandra table when spark tasks fail and restarted again. The schema of the table I am trying to insert.
CREATE TABLE duplicate_record (object_id bigint,entity_key timeuuid,
PRIMARY KEY (object_id, entity_key));
Example of duplicate records in a table
1181592431 uuid
1181592431 uuid1
8082869622 uuid2
8082869622 uuid3
I have a df that is created by a left join between Oracle and Cassandra. Thus, we already have records in Cassandra and new records that are generated by Oracle. I am applying a map of each entry to see if the entity_id exists. If it exists, use it differently so that new entries create a new entity_id and then save. I am using saveToCassandra to insert this df into Cassandra.
When the task fails and restarts, the already inserted records are reinserted with a different entity_key. I think that the inserted record during successful execution is not available when the task is resubmitted, resulting in duplicate records.
source to share
No one has answered this question yet
Check out similar questions: