Cassandra Spark writes slowly

I am making a small Spark application using Spark Cassandra connector and dataframes in python, but I am getting extremely slow write speeds. When I look at the application logs it says:

17/03/28 20:04:05 INFO TableWriter: Wrote 315514 rows to movies.moviescores in 662.134 s.    

      

That's about 474 lines per second.

I am reading some data from Cassandra into a table, then doing some operations on it (which also makes the set much larger). And then I write the result back to cassandra (about 50 million lines):

result.write.format("org.apache.spark.sql.cassandra").mode('append').options(table="moviescores", keyspace="movies").save()

      

Where the result is a data frame.

Here's my key space creation, if that matters:

CREATE KEYSPACE IF NOT EXISTS movies WITH REPLICATION = { \'class\' : \'NetworkTopologyStrategy\', \'datacenter1\' : 3 };

      

And the table I'm writing:

CREATE TABLE IF NOT EXISTS movieScores(movieId1 int, movieId2 int, score int, PRIMARY KEY((movieId1, movieId2)));

      

My setup looks like this: I have 5 Spark workers running in Docker containers, each running a different Core6 operating system with a CoreOS operating system with 2GB of RAM and two cores running on Digitalocean. 3 Cassandra nodes running in Docker containers, each on a different node running CoreOS with 2GB of RAM and two cores running on Digitalocean.

Nodes running Spark have 2GB of RAM, but they can only use up to 1GB as this is the default setting for Sparks for offline mode:

(default: your machine total RAM minus 1 GB)

      

Not sure if it's wise to raise it.

Now I read that I have to run Spark Worker and Cassandra node for every node in my Digital Ocean cluster. But I'm not sure if it's a good idea to run a Docker container with Spark and another container with a Cassandra node on a 2GB machine with two cores.

Why does he write so slowly? Are there parameters / settings that I have to change / set to increase the write speed? Perhaps my setup is wrong? I am completely new to Spark and Cassandra.

Update: I just ran the test on one table without Spark, using only the Cassandra Python connector and a little Python program on my laptop. I used 1000 line batch insert and I could insert 1 million lines in just 35 seconds, which is almost 30,000 lines per second, faster. So maybe Spark is the problem, not Cassandra. Perhaps it would make sense to leave the rest of my code here? or is there something wrong with my setup?

+3


source to share


1 answer


I recently ran into similar problems when I saved over 80 million records on Cassandra. In my case, I was using Spark Java API. Which helped solve my problems, I applied orderBy () to the dataset before saving it to Cassandra via spark-cassandra-connector. Try to order your dataset first and then store () in Cassandra.



0


source







All Articles