Spark with a BloomFilter of billions of records results in Kryo's failed serialization: Buffer overflow.

Question

Spark with a BloomFilter of billions of records results in Kryo's failed serialization: Buffer overflow.

I used Breeze's implementation of the Bloom filter in Apache sparks. My Bloom filter expects 200,000,000 keys. But I ran into the below exception:

User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5.0 failed 4 times, most recent failure: Lost task 1.3 in stage 5.0 (TID 161, SVDG0752.ideaconnect.com): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1

I know to avoid this I can increase the spark.kryoserializer.buffer.max value, but due to cluster resource limitations, I cannot increase it more than 2GB.

Below is the code:

val numOfBits=2147483647
val numOfHashFun=13
val bf = hierachyMatching.treeAggregate(new BloomFilter[String](numOfBits,numOfHashFun))(
  _ += _, _ |= _)

where hierachyMatching is a String Rdd containing 200M records.

My questions:

How can I solve this exception without increasing buffer.max and How?
Is it possible to build a bloom filter containing over 2 billion bits per spark with 6512mb driver memory and How?

Any ideas or suggestions related to this would be much appreciated. Thanks in advance.

+3

scala bigdata apache-spark bloom-filter

mayur 08 Aug 17 at 10:31

source to share

1 answer

Artem Rukavytsia · Answer 1 · 2017-08-08T10:49:58+0000

Try to specify spark.kryoserializer.buffer.max

- 1 gb

(or experiment with this property to pick the best value) in spark-default.conf

(or overridden properties) and restart the spark service, that should help you.

Spark with a BloomFilter of billions of records results in Kryo's failed serialization: Buffer overflow.

More articles: