Sparks datasax cassandra connector slow to read from heavy cassandra table

I am new to Spark / Spark Cassandra Connector. We are trying to spark for the first time on our team and we are using the spark cassandra connector to connect to the cassandra database.

I wrote a query that uses a heavy database table and I saw that the Spark Task would not start until the query to the table fetched all records.

It takes more than 3 hours to get all records from the database.

To get data from the database, we use.

  CassandraJavaUtil.javaFunctions(sparkContextManager.getJavaSparkContext(SOURCE).sc())
    .cassandraTable(keyspaceName, tableName);

      

Is there a way to tell the spark will start working even if all the data doesn't end up being loaded?

Is there a way to tell spark-cassandra-connector to use more threads for sampling?

thanks, Coca.

+3


source to share


1 answer


If you look at the Spark UI, how many partitions will a table scan create? I just did something like this and I found that Spark was creating too many partitions to scan and it was taking a lot longer as a result. The way I cut down on the time at my job is to set the config parameter spark.cassandra.input.split.size_in_mb

to a value higher than the default. In my case, it took 20 minutes of work to about four minutes. There are also a few more Cassandra that read Spark special variables that you can set here here . These stackoverflow questions are what I referenced originally, I hope they help you.

Iterate a large Cassandra table in small chunks

Set the number of tasks when scanning a Cassandra table



EDIT:

After running some performance tests related to some of Spark's configuration options, I found that Spark was creating too many table partitions when I didn't provide enough memory for the Spark runners. In my case, the increase in memory per gigabyte was enough to make the input split size parameter unnecessary. If you cannot give the executors more memory, you may need to install spark.cassandra.input.split.size_in_mb

above as a workaround.

+3


source







All Articles