Spark with a BloomFilter of billions of records results in Kryo's failed serialization: Buffer overflow.
I used Breeze's implementation of the Bloom filter in Apache sparks. My Bloom filter expects 200,000,000 keys. But I ran into the below exception:
User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5.0 failed 4 times, most recent failure: Lost task 1.3 in stage 5.0 (TID 161, SVDG0752.ideaconnect.com): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1
I know to avoid this I can increase the spark.kryoserializer.buffer.max value, but due to cluster resource limitations, I cannot increase it more than 2GB.
Below is the code:
val numOfBits=2147483647
val numOfHashFun=13
val bf = hierachyMatching.treeAggregate(new BloomFilter[String](numOfBits,numOfHashFun))(
_ += _, _ |= _)
where hierachyMatching is a String Rdd containing 200M records.
My questions:
- How can I solve this exception without increasing buffer.max and How?
- Is it possible to build a bloom filter containing over 2 billion bits per spark with 6512mb driver memory and How?
Any ideas or suggestions related to this would be much appreciated. Thanks in advance.
+3
source to share