Sparks in Scala: How to Avoid Linear Scanning to Find a Key in Each Section?

I have one huge dataset with a key named A and a set of keys named B as queries. My challenge is that for every key in B, return the key in or not, if it exists, return the value.

First I split A into a HashParitioner (100). Currently I can use A.join (B ') where B' = B.map (x => (x, null)). Or we can use A.lookup () for each key in B.

The problem, however, is that both join and search for PairRDD is a line scan for each section . It's too slow. As I wish, each section can be a Hashmap, so that we can find the key in each parse in O (1). So the ideal strategy is that when the master receives a bunch of keys, the master assigns each key to the corresponding section, then the section uses its Hashmap to find the keys and return the result to the master.

Is there an easy way to achieve this?

One possible way: When I searched the internet, a similar question is here

http://mail-archives.us.apache.org/mod_mbox/spark-user/201401.mbox/% 3CCAMwrk0kPiHoX6mAiwZTfkGRPxKURHhn9iqvFHfa4aGj3XJUCNg@mail.gmail .com% 3E

As he said, I built a Hashmap for each section using the following code:

 val hashpair = A.mapPartitions(iterator => {
     val hashmap = new HashMap[Long, Double]
     iterator.foreach { case (key, value)  => hashmap.getOrElseUpdate(key,value) }
     Iterator(hashmap)
 })

      

Now I am getting 100 Hashmap (assuming I have 100 partitions for A data). Here I am lost. I don't know how to ask how to use hashpair to find keys in B, since hashpair is not a normal RDD. Do I need to implement a new RDD and implement RDD methods for hashpair? If so, what's the simplest way to implement join or lookup methods for hashpair?

Thanks everyone.

+3


source to share


1 answer


You are probably looking for IndexedRDD: https://github.com/amplab/spark-indexedrdd



+6


source







All Articles