Setting gridSize in Spring Batch partitioning
In Spring Partitioning, the relationship between gridSize
PartitionHandler and the number of ExecutionContext returned by Partitioner is a bit confusing. For example, the MultiResourcePartitioner claims to ignore gridSize, but the documentation Partitioner
doesn't explain when / why it's okay to do this.
For example, let's say I have taskExecutor
one that I want to reuse for different parallel steps, and that I set its size to 20. If I use a TaskExecutorPartitionerHandler with a grid size of 5 and MultiResourcePartitioner
that returns an arbitrary number of partitions (one per file) like yourself conduct parallelism?
Let's say it MultiResourcePartitioner
returns 10 sections for a specific launch. Does this mean that only 5 of them will run concurrently until all 10 are complete, and that no more than 5 of 20 threads will be used for this step?
If so, when / why is it okay to ignore the "gridSize" parameter when overridden Parititioner
with a custom implementation? I think it will help if described in the documentation.
If this is not the case, how can I achieve this? That is, how can I reuse the task executor and separately determine the number of sections that can run in parallel for this step, and the number of actually created sections?
source to share
There are some good questions here, so go through them individually:
For example, say I have a TaskExecutor that I want to reuse for different parallel steps, and that I have set its size to 20. If I use a TaskExecutorPartitionerHandler with a grid size of 5 and a MultiResourcePartitioner that returns an arbitrary number of partitions (one per file), how will parallelism behave?
TaskExecutorPartitionHandler
overrides the concurrency restrictions TaskExecutor
that you provide. Because of this, your example PartitionHandler
will use up to 20 threads as it allows TaskExecutor
.
If so, when / why is it okay to ignore the "gridSize" parameter when overriding Parititioner with a custom implementation? I think it will help if described in the documentation.
When we look at a partitioned step, there are two components to the problem: Partitioner
and PartitionHandler
. Partitioner
responsible for understanding shared data and how best to do it. PartitionHandler
responsible for delegating this work to the slaves for execution. In order to PartitionHandler
execute his delegation, he must understand the "fabric" with which he works (local threads, remote slave processes, etc.).
When dividing the data to be processed (via Partitioner
), it is helpful to know how many workers are available. However, this metric is not always very useful based on the data you are working with. For example, when splitting database rows, it makes sense to split them evenly by the number of available workers. However, in most scenarios, it is impractical to combine or split files, so it is easy to create a section for each file. Both of these scenarios depend on the data you are trying to partition whether gridSize is useful or not.
If this is not the case, how can I achieve this? That is, how can I reuse the task executor and separately determine the number of sections that can run in parallel for this step, and the number of actually created sections?
If you reuse TaskExecutor
it, you may not be able to, as it TaskExecutor
might do other things. I wonder why you are reusing one of them, given the relatively low overhead of creating a single dedicated one (you can even make it a step covered so that it only gets created when working with a partitioned step).
source to share