Markup for custom RDDs
JDBCRDD
potentially split for efficient parallelization of queries in the database.
Is there a way to migrate how the data is partitioned as a useful hint for the next step, perhaps groupBy
without reworking the data?
Example: I am loading date / region / value. From JDBCRDD
I am loading data split into Date. If I want to decrease / groupBy the date and region, I have to again not call sorting and shuffling for the date, and also use the fact that the RDD is already split by date.
In a pseudo API, I would do the following:
RDD rdd = new JDCBCRDD ...
Partitioner partitioning = (Row r)->p(r)
rdd.assertPartitioning(partitioning);
RDD<Pair<Key,Row>> rdd2 = rdd.groupWithinPartition((r)->f(r),Rowoperator::sum);
So now, in theory, all my groups should be executable JVM instances local, same node, same JVM, same thread.
source to share