Spark task schedule

Question

Spark task schedule

I am doing a fairly large task on my 4 node cluster. I am reading about 4GB of filtered data from one table and doing Naïve Bayes training and prediction. I have an HBase realm server running on a single machine, which is separate from the spark cluster running in fair scheduling mode, although HDFS runs on all machines.

At runtime, I'm experiencing a strange distribution of tasks in terms of the number of active tasks in the cluster. I noticed that only one active task, or at most two tasks, is running on one / two machines at any given time, while the other is sitting idle. My assumption was that the data in the RDD would be split and processed in all nodes for operations like counter and distinct etc. Why aren't all nodes being used for large tasks in one job? Does HBase on a single machine have anything to do with this?

+3

mapreduce hadoop yarn apache-spark hadoop2

Tinku 29 Sep 14 at 12:35

source to share

1 answer

Spiro Michaylov · Accepted Answer · 2014-09-30T02:11:00+0000

Some things to check:

Presumably, you are reading your data with hadoopFile()

or hadoopRDD()

: Consider setting the [optional] parameter minPartitions

to make sure the number of partitions is equal to the number of nodes you want to use.
When you create another one RDD

in your application, check the number of sections of these RDD

and spread the data evenly across them. (Sometimes an operation can create RDD

with the same number of partitions, but it can make the data in it poorly unbalanced.) You can check this by calling the method glom()

, printing the number of elements of the resultant RDD

(number of partitions), and then looping through it and printing the number of elements of each of the arrays. (This introduces a message, so don't leave it in your production code.)
Many API calls RDD

have optional parameters to set the number of partitions, and then there are calls like repartition()

and coalesce()

that can change the partition. Use them to fix problems you find using the technique above (but sometimes it will reveal the need to rethink your algorithm.)
Make sure you are actually using RDD

for all of your big data and don't accidentally end up with some big data structure on the server.

All of this assumes that you have data skewing issues and not something more sinister. This does not guarantee that this is true, but you need to check the data situation before looking for something complex. It's easy to copy data, especially given Spark's flexibility, and it can get a mess.

Spark task schedule

More articles: