Spark cluster performance decreases by adding more nodes
I have a large dataset from 1B records and want to run analytics using Apache spark because of the scaling it provides, but I see an anti-pattern here. The more nodes I add to the spark cluster, the more the completion time increases. The data store is Cassandra and the queries are Zeppelin. I've tried many different queries, but even a simple query dataframe.count()
behaves like this.
Here is a zeppelin laptop, temp table has 18M records
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "temp", "keyspace" -> "mykeyspace"))
.load().cache()
df.registerTempTable("table")
%sql
SELECT first(devid),date,count(1) FROM table group by date,rtu order by date
when testing for different values. from spark working nodes these were the results
+-------------+---------------+
| Spark Nodes | Time |
+-------------+---------------+
| 1 node | 17 min 59 sec |
| 2 nodes | 12 min 51 sec |
| 3 nodes | 15 min 49 sec |
| 4 nodes | 22 min 58 sec |
+-------------+---------------+
Increasing the number. nodes reduces performance. which shouldn't happen as it defeats the purpose of using Spark.
if you would like me to run any query or additional setup information please ask. Any hints as to why this is happening would be very helpful, stuck on this for two days. Thank you for your time.
versions
Zepelin: 0.7.1, Spark: 2.1.0, Cassandra: 2.2.9, Connector: Datastax: spark-cassandra-connector: 2.0.1-s_2.11
Spark class specifications
6 vCPUs, 32 GB memory = 1 node
Cassandra + Zeppelin Server Features
8 vCPU, 52 GB memory
source to share
One thing to consider is that at some point you can suppress Cassandra's requests. Without scaling the Cassandra side of the equation, you can easily see the diminishing returns as C * ends up spending too much time rejecting requests.
This is basically a Man-month bug. Just because you can throw more workers on a problem does not necessarily mean that the project can be completed faster.
It would be very helpful for you to compare the different parts of your query separately. Currently, when you have this, the entire dataset is read cache, which adds extra slowness if you are comparing a single request.
You should test in isolation
- Reading from C * without caching (just counting directly from C *)
- Caching cost (count after caching)
- Cost of a completed shuffle request (from a launch cache request)
Then you can figure out where your bottlenecks are and scale accordingly.
source to share