Why is the value of hardcode reallocation

Looking at the example spark code I can see that the numbers in the reallocation or concatenation are hardcoded:

val resDF = df.coalesce(16)

      

what is the best approach to manage this setting when this hardcoded value becomes irrelevant, when the cluster can update dynamically in a matter of seconds.

+3


source to share


1 answer


Well, there are usually hardcoded values ​​in the examples, so you don't need to worry, I mean you can change this example. I mean the section documentation is filled with hardcoded values, but those values ​​are just examples.

The rule of thumb about the number of sections:

its RDD would need to have as many partitions as the number of executors times the number of cores used by 3 (or possibly 4). Of course, this is a heuristic and it really depends on your application, dataset, and cluster configuration.



Note, however, that the reallocation does not come for free, so in a dynamically dynamic environment you need to be sure that the overhead of reallocation is negligible in relation to the profit you will receive from this operation.

Coalesce and repartition can have different costs, which I mentioned in my answer.

+2


source







All Articles