Co-location and co-sharing of RDDs

Question

Co-location and co-sharing of RDDs

I am very new to Spark and I have 2 questions:

I have a large set of points, and I made an RDD (called partitionedData

) out of them and split it based on a custom delimiter so that each section has no more than a threshold number of points. Since I need to select some points as the leader in each section and make sure that the corresponding leaders and points in each section are in the same node , I mapPartitions

partitionedData

set the flag preservesPartitioning

as true

. Finally, the result of this RDD is my desired RDD leader . Here is my first question: I know the RDD leader preserves its parent RDD partition ( co-partitioned ), but I'm not sure if the leaders in each partition will fit in the same node as their parents. Are the points ( located with each other )?
If the answer to the above question is NO , how can I co-host the partitions of this RDD with another pre-partitioned RDD?

+3

apache-spark rdd

Farzad nozarian Apr 22. '15 at 9:50

source to share

1 answer

Javier Bañez · Answer 1 · 2018-02-19T15:38:54+0000

In order to be positioned so that you can ensure that you are not shuffling, all the co-located sections must be executed within the same action.

If you had intermediate actions, the Integer index generated by the custom sector separator could be assigned to different nodes, in which case shuffling is required.

Co-location and co-sharing of RDDs

More articles: