How can I reduce the size of the Apache Spark task

Question

How can I reduce the size of the Apache Spark task

I am trying to run the following code in scala on Spark framework, but I am getting extremely large task size (8MB)

tidRDD:RDD[ItemSet]
mh:MineHelper
x:ItemSet
broadcast_tid:Broadcast[Array[ItemSet]]
count:Int

tidRDD.flatMap(x => mh.mineFreqSets(x, broadcast_tid.value, count)).collect()

The reason I added the class MinerHelper

was to make it serializable and it only contains the given method. ItemSet

is a class with 3 private members and several getter / setter methods, nothing fancy. I believe this is the correct way to approach this problem, but Spark thinks differently. Am I doing some gaping mistakes, or is it something small that's wrong?

Here's a warning:

WARN TaskSetManager: Stage 1 contains a task of very large size (8301 KB). The maximum recommended task size is 100 KB.

+3

scala task apache-spark rdd

solistice May 26 '15 at 13:02

source to share

1 answer

Lomig Mégard · Answer 1 · 2015-05-27T21:46:49+0000

You are probably closing this

by forcing the entire enclosing object to serialize.

You probably have something like the following:

class Foo {
  val outer = ??? 
  def f(rdd: RDD[ItemSet]): RDD[ItemSet] = {
    rdd.map(x => outer.g(x))
  }
}

In this case, when serializing the Spark task, an instance of the supplied Foo

. Indeed, when you refer to outer

, you really mean it this.outer

.

A simple fix is to put external variables into local ones:

class Foo {
  val outer = ??? 
  def f(rdd: RDD[ItemSet]): RDD[ItemSet] = {
    val _outer = outer         // local variable
    rdd.map(x => _outer.g(x))  // no reference to `this`
  }
}

How can I reduce the size of the Apache Spark task

More articles: