Spark task schedule

I am doing a fairly large task on my 4 node cluster. I am reading about 4GB of filtered data from one table and doing Naïve Bayes training and prediction. I have an HBase realm server running on a single machine, which is separate from the spark cluster running in fair scheduling mode, although HDFS runs on all machines.

At runtime, I'm experiencing a strange distribution of tasks in terms of the number of active tasks in the cluster. I noticed that only one active task, or at most two tasks, is running on one / two machines at any given time, while the other is sitting idle. My assumption was that the data in the RDD would be split and processed in all nodes for operations like counter and distinct etc. Why aren't all nodes being used for large tasks in one job? Does HBase on a single machine have anything to do with this?


source to share

1 answer

Some things to check:

  • Presumably, you are reading your data with hadoopFile()

    or hadoopRDD()

    : Consider setting the [optional] parameter minPartitions

    to make sure the number of partitions is equal to the number of nodes you want to use.
  • When you create another one RDD

    in your application, check the number of sections of these RDD

    and spread the data evenly across them. (Sometimes an operation can create RDD

    with the same number of partitions, but it can make the data in it poorly unbalanced.) You can check this by calling the method glom()

    , printing the number of elements of the resultant RDD

    (number of partitions), and then looping through it and printing the number of elements of each of the arrays. (This introduces a message, so don't leave it in your production code.)
  • Many API calls RDD

    have optional parameters to set the number of partitions, and then there are calls like repartition()

    and coalesce()

    that can change the partition. Use them to fix problems you find using the technique above (but sometimes it will reveal the need to rethink your algorithm.)
  • Make sure you are actually using RDD

    for all of your big data and don't accidentally end up with some big data structure on the server.

All of this assumes that you have data skewing issues and not something more sinister. This does not guarantee that this is true, but you need to check the data situation before looking for something complex. It's easy to copy data, especially given Spark's flexibility, and it can get a mess.



All Articles