Reducing with a color filter
I would like to get a quick approximate specified membership based on a String-valued function applied to a large Spark RDD String Vectors (~ 1B records). Basically the idea would be to collapse into Bloom filter . This color filter can then be handed over to workers for later use.
In particular, I am currently
rdd: RDD[Vector[String]]
f: Vector[String] => String
val uniqueVals = rdd.map(f).distinct().collect()
val uv = sc.broadcast(uniqueVals)
But uniqueVals
too big to be practical, and I would like to replace it with something smaller (and known) in size, i.e. a color filter.
My questions:
-
Is it possible to collapse into a Bloom filter, or do I need to first build and then build it in the driver?
-
is there a mature Scala / Java Bloom filter implementation that is suitable for this?
source to share
Yes, Bloom filters can be scaled down because they have good properties (they are monoids ). This means that you can do all the aggregations in parallel, effectively using only one pass over the data to build a BloomFilter for each section, and then shrink those BloomFilters together to get a single BloomFilter that you can query for contains
.
There are at least two BloomFilter implementations in Scala and both seem to be mature projects (haven't actually used them in production). The first is Breeze and the second is Twitter Algebird . Both contain implementations of different sketches and more.
This is an example of how to do it with Breeze:
import breeze.util.BloomFilter
val nums = List(1 to 20: _*).map(_.toString)
val rdd = sc.parallelize(nums, 5)
val bf = rdd.mapPartitions { iter =>
val bf = BloomFilter.optimallySized[String](10000, 0.001)
iter.foreach(i => bf += i)
Iterator(bf)
}.reduce(_ | _)
println(bf.contains("5")) // true
println(bf.contains("31")) // false
source to share