Markup for custom RDDs

JDBCRDD

potentially split for efficient parallelization of queries in the database.

Is there a way to migrate how the data is partitioned as a useful hint for the next step, perhaps groupBy

without reworking the data?

Example: I am loading date / region / value. From JDBCRDD

I am loading data split into Date. If I want to decrease / groupBy the date and region, I have to again not call sorting and shuffling for the date, and also use the fact that the RDD is already split by date.

In a pseudo API, I would do the following:

RDD rdd = new JDCBCRDD ...
Partitioner partitioning = (Row r)->p(r)
rdd.assertPartitioning(partitioning);
RDD<Pair<Key,Row>> rdd2 = rdd.groupWithinPartition((r)->f(r),Rowoperator::sum);

      

So now, in theory, all my groups should be executable JVM instances local, same node, same JVM, same thread.

+3


source to share


2 answers


If you mean you need to store section index information with each element, I think mapWith is what you need. You can group the index of the data section into a new class and move on to the next step.



+1


source


The split is controlled by the hash value of the elements in the RDD. To avoid shuffling to the next step, you basically need to ensure that the same hash value is generated. You do this by overriding the method hashCode

.



0


source







All Articles