Cassandra malfunctions with batch updates
I am using Apache Spark with the Spark Cassandra connector to write millions of rows to a Cassandra cluster. The replication factor is set to 3 and I am setting the ALL write consistency in spark-submit mode (YARN client mode) using the following parameters:
spark-submit ...
--conf spark.cassandra.output.consistency.level=ALL \
--conf spark.cassandra.output.concurrent.writes=1 \
--conf spark.cassandra.output.batch.size.bytes=20000 \
...
Then I wrote another Spark job to count the data I wrote. I have sequenced the new job as follows:
spark-submit ...
--conf spark.cassandra.input.consistency.level=ONE \
--conf spark.cassandra.input.split.size=50000 \
...
From the documentation, if the write consistency plus read consistency is greater than the replication factor, then I should have sequential reads.
But I am getting the following results:
- Reading job gives me different results (score) every time I run it
- If I increase the consistency level of the read assignment, I get the expected results
What am I missing? Is there a secret config that's set up by default (for example in case of problems while recording, then lower the consistency level or something ...) or am I using the buggy version of Cassandra (that's 2.1.2) or is there problems with batch updates that spark-cassandra-connector uses to save data in Cassandra (I just use saveToCassandra method)?
What will go wrong?
source to share