How can I improve the runtime for recent tasks in a stand-alone cluster?

Question

How can I improve the runtime for recent tasks in a stand-alone cluster?

I have this code:

JavaRDD<Document> termDocsRdd = sc.wholeTextFiles("D:/tmp11", 20).flatMap(
    new FlatMapFunction<Tuple2<String,String>, Document>() {
        @Override
        public Iterable<Document> call(Tuple2<String,String> tup) { 
            return Arrays.asList(DocParse.parse(parsingFunction(tup)); 
        }
    }
);

Here I am taking text files from local storage (not distributed filesystem) and normalizing them (each file ~ 100KB - 1.5MB). The parsingFunction doesn't have any Spark functions like map or flatMap etc. It does not contain data distribution functions.

When I run the application in a standalone cluster, I see that the workload of all processors of the working machine is full (100%), I see in the console:

14/12/09 18:30:41 INFO scheduler.TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8,     fujitsu11., PROCESS_LOCAL, 1377 bytes)
14/12/09 18:30:41 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 12873     ms on fujitsu11. (1/12)
14/12/09 18:30:42 INFO scheduler.TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9,     fujitsu11., PROCESS_LOCAL, 1327 bytes)
14/12/09 18:30:42 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 14001     ms on fujitsu11. (2/12)
14/12/09 18:30:44 INFO scheduler.TaskSetManager: Starting task 10.0 in stage 0.0 (TID 10,     fujitsu11., PROCESS_LOCAL, 1327 bytes)
14/12/09 18:30:44 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 15925     ms on fujitsu11. (3/12)
...

Later I see that the latter tasks are much slower - cpu workload ~ 15%:

14/12/09 18:31:18 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 0.0 (TID 10) in     33373 ms on fujitsu11. (11/12)
14/12/09 18:32:38 INFO scheduler.TaskSetManager: Finished task 11.0 in stage 0.0 (TID 11) in     104181 ms on fujitsu11. (12/12)

How can I improve the performance of this code?

My cluster is one master machine and another machine for the slave. All machines have 8 cores and 16 GB of RAM.

+3

apache-spark

dimson 09 dec. 14 at 15:57

source to share

1 answer

Daniel Darabos · Accepted Answer · 2014-12-10T09:33:04+0000

You have 8 executor cores (the ones on the slave). RDD probably has 20 sections (from call wholeTextFiles

). When an assignment starts, 20 assignments are created and the executor receives 8 of them. When one task is completed, a new one will be received. Eventually, there are less than 8 tasks left, and the threads of executors will begin to be idle. You can see that the CPU usage gradually decreases until the job completes.

Note that you are using a single computer (the wizard does not do the job) and a distributed computing system. This is fine for development and when you don't need performance. But if you want to improve performance, use multiple machines or don't use Spark.

How can I improve the runtime for recent tasks in a stand-alone cluster?

More articles: