How can I reduce the size of the Apache Spark task
I am trying to run the following code in scala on Spark framework, but I am getting extremely large task size (8MB)
tidRDD:RDD[ItemSet]
mh:MineHelper
x:ItemSet
broadcast_tid:Broadcast[Array[ItemSet]]
count:Int
tidRDD.flatMap(x => mh.mineFreqSets(x, broadcast_tid.value, count)).collect()
The reason I added the class MinerHelper
was to make it serializable and it only contains the given method. ItemSet
is a class with 3 private members and several getter / setter methods, nothing fancy. I believe this is the correct way to approach this problem, but Spark thinks differently. Am I doing some gaping mistakes, or is it something small that's wrong?
Here's a warning:
WARN TaskSetManager: Stage 1 contains a task of very large size (8301 KB). The maximum recommended task size is 100 KB.
source to share
You are probably closing this
by forcing the entire enclosing object to serialize.
You probably have something like the following:
class Foo {
val outer = ???
def f(rdd: RDD[ItemSet]): RDD[ItemSet] = {
rdd.map(x => outer.g(x))
}
}
In this case, when serializing the Spark task, an instance of the supplied Foo
. Indeed, when you refer to outer
, you really mean it this.outer
.
A simple fix is ββto put external variables into local ones:
class Foo {
val outer = ???
def f(rdd: RDD[ItemSet]): RDD[ItemSet] = {
val _outer = outer // local variable
rdd.map(x => _outer.g(x)) // no reference to `this`
}
}
source to share