Co-location and co-sharing of RDDs

I am very new to Spark and I have 2 questions:

  • I have a large set of points, and I made an RDD (called partitionedData

    ) out of them and split it based on a custom delimiter so that each section has no more than a threshold number of points. Since I need to select some points as the leader in each section and make sure that the corresponding leaders and points in each section are in the same node , I mapPartitions


    set the flag preservesPartitioning

    as true

    . Finally, the result of this RDD is my desired RDD leader . Here is my first question: I know the RDD leader preserves its parent RDD partition ( co-partitioned ), but I'm not sure if the leaders in each partition will fit in the same node as their parents. Are the points ( located with each other )?
  • If the answer to the above question is NO , how can I co-host the partitions of this RDD with another pre-partitioned RDD?

source to share

1 answer

In order to be positioned so that you can ensure that you are not shuffling, all the co-located sections must be executed within the same action.

If you had intermediate actions, the Integer index generated by the custom sector separator could be assigned to different nodes, in which case shuffling is required.



All Articles