How can I improve the runtime for recent tasks in a stand-alone cluster?
I have this code:
JavaRDD<Document> termDocsRdd = sc.wholeTextFiles("D:/tmp11", 20).flatMap(
new FlatMapFunction<Tuple2<String,String>, Document>() {
@Override
public Iterable<Document> call(Tuple2<String,String> tup) {
return Arrays.asList(DocParse.parse(parsingFunction(tup));
}
}
);
Here I am taking text files from local storage (not distributed filesystem) and normalizing them (each file ~ 100KB - 1.5MB). The parsingFunction doesn't have any Spark functions like map or flatMap etc. It does not contain data distribution functions.
When I run the application in a standalone cluster, I see that the workload of all processors of the working machine is full (100%), I see in the console:
14/12/09 18:30:41 INFO scheduler.TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, fujitsu11., PROCESS_LOCAL, 1377 bytes)
14/12/09 18:30:41 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 12873 ms on fujitsu11. (1/12)
14/12/09 18:30:42 INFO scheduler.TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9, fujitsu11., PROCESS_LOCAL, 1327 bytes)
14/12/09 18:30:42 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 14001 ms on fujitsu11. (2/12)
14/12/09 18:30:44 INFO scheduler.TaskSetManager: Starting task 10.0 in stage 0.0 (TID 10, fujitsu11., PROCESS_LOCAL, 1327 bytes)
14/12/09 18:30:44 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 15925 ms on fujitsu11. (3/12)
...
Later I see that the latter tasks are much slower - cpu workload ~ 15%:
14/12/09 18:31:18 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 0.0 (TID 10) in 33373 ms on fujitsu11. (11/12)
14/12/09 18:32:38 INFO scheduler.TaskSetManager: Finished task 11.0 in stage 0.0 (TID 11) in 104181 ms on fujitsu11. (12/12)
How can I improve the performance of this code?
My cluster is one master machine and another machine for the slave. All machines have 8 cores and 16 GB of RAM.
source to share
You have 8 executor cores (the ones on the slave). RDD probably has 20 sections (from call wholeTextFiles
). When an assignment starts, 20 assignments are created and the executor receives 8 of them. When one task is completed, a new one will be received. Eventually, there are less than 8 tasks left, and the threads of executors will begin to be idle. You can see that the CPU usage gradually decreases until the job completes.
Note that you are using a single computer (the wizard does not do the job) and a distributed computing system. This is fine for development and when you don't need performance. But if you want to improve performance, use multiple machines or don't use Spark.
source to share