Cassandra malfunctions with batch updates

I am using Apache Spark with the Spark Cassandra connector to write millions of rows to a Cassandra cluster. The replication factor is set to 3 and I am setting the ALL write consistency in spark-submit mode (YARN client mode) using the following parameters:

spark-submit ...
--conf spark.cassandra.output.consistency.level=ALL \
--conf spark.cassandra.output.concurrent.writes=1 \
--conf spark.cassandra.output.batch.size.bytes=20000 \
...

      

Then I wrote another Spark job to count the data I wrote. I have sequenced the new job as follows:

spark-submit ...
--conf spark.cassandra.input.consistency.level=ONE \
--conf spark.cassandra.input.split.size=50000 \
...

      

From the documentation, if the write consistency plus read consistency is greater than the replication factor, then I should have sequential reads.

But I am getting the following results:

  • Reading job gives me different results (score) every time I run it
  • If I increase the consistency level of the read assignment, I get the expected results

What am I missing? Is there a secret config that's set up by default (for example in case of problems while recording, then lower the consistency level or something ...) or am I using the buggy version of Cassandra (that's 2.1.2) or is there problems with batch updates that spark-cassandra-connector uses to save data in Cassandra (I just use saveToCassandra method)?

What will go wrong?

+3


source to share


1 answer


I confirm that this is a bug in the connector. The consistency level is set on individual prepared statements and is simply ignored when using statement statements. Follow the updates on the connector - the fix will be included in the next bug fix release.



+3


source







All Articles