Spark task schedule
I am doing a fairly large task on my 4 node cluster. I am reading about 4GB of filtered data from one table and doing Naïve Bayes training and prediction. I have an HBase realm server running on a single machine, which is separate from the spark cluster running in fair scheduling mode, although HDFS runs on all machines.
At runtime, I'm experiencing a strange distribution of tasks in terms of the number of active tasks in the cluster. I noticed that only one active task, or at most two tasks, is running on one / two machines at any given time, while the other is sitting idle. My assumption was that the data in the RDD would be split and processed in all nodes for operations like counter and distinct etc. Why aren't all nodes being used for large tasks in one job? Does HBase on a single machine have anything to do with this?
source to share
Some things to check:
- Presumably, you are reading your data with
hadoopFile()
orhadoopRDD()
: Consider setting the [optional] parameterminPartitions
to make sure the number of partitions is equal to the number of nodes you want to use. - When you create another one
RDD
in your application, check the number of sections of theseRDD
and spread the data evenly across them. (Sometimes an operation can createRDD
with the same number of partitions, but it can make the data in it poorly unbalanced.) You can check this by calling the methodglom()
, printing the number of elements of the resultantRDD
(number of partitions), and then looping through it and printing the number of elements of each of the arrays. (This introduces a message, so don't leave it in your production code.) - Many API calls
RDD
have optional parameters to set the number of partitions, and then there are calls likerepartition()
andcoalesce()
that can change the partition. Use them to fix problems you find using the technique above (but sometimes it will reveal the need to rethink your algorithm.) - Make sure you are actually using
RDD
for all of your big data and don't accidentally end up with some big data structure on the server.
All of this assumes that you have data skewing issues and not something more sinister. This does not guarantee that this is true, but you need to check the data situation before looking for something complex. It's easy to copy data, especially given Spark's flexibility, and it can get a mess.
source to share