Markup for custom RDDs
JDBCRDD
potentially split for efficient parallelization of queries in the database.
Is there a way to migrate how the data is partitioned as a useful hint for the next step, perhaps groupBy
without reworking the data?
Example: I am loading date / region / value. From JDBCRDD
I am loading data split into Date. If I want to decrease / groupBy the date and region, I have to again not call sorting and shuffling for the date, and also use the fact that the RDD is already split by date.
In a pseudo API, I would do the following:
RDD rdd = new JDCBCRDD ...
Partitioner partitioning = (Row r)->p(r)
rdd.assertPartitioning(partitioning);
RDD<Pair<Key,Row>> rdd2 = rdd.groupWithinPartition((r)->f(r),Rowoperator::sum);
So now, in theory, all my groups should be executable JVM instances local, same node, same JVM, same thread.
If you mean you need to store section index information with each element, I think mapWith is what you need. You can group the index of the data section into a new class and move on to the next step.
The split is controlled by the hash value of the elements in the RDD. To avoid shuffling to the next step, you basically need to ensure that the same hash value is generated. You do this by overriding the method hashCode
.