Setting gridSize in Spring Batch partitioning

In Spring Partitioning, the relationship between gridSize

PartitionHandler and the number of ExecutionContext returned by Partitioner is a bit confusing. For example, the MultiResourcePartitioner claims to ignore gridSize, but the documentation Partitioner

doesn't explain when / why it's okay to do this.

For example, let's say I have taskExecutor

one that I want to reuse for different parallel steps, and that I set its size to 20. If I use a TaskExecutorPartitionerHandler with a grid size of 5 and MultiResourcePartitioner

that returns an arbitrary number of partitions (one per file) like yourself conduct parallelism?

Let's say it MultiResourcePartitioner

returns 10 sections for a specific launch. Does this mean that only 5 of them will run concurrently until all 10 are complete, and that no more than 5 of 20 threads will be used for this step?

If so, when / why is it okay to ignore the "gridSize" parameter when overridden Parititioner

with a custom implementation? I think it will help if described in the documentation.

If this is not the case, how can I achieve this? That is, how can I reuse the task executor and separately determine the number of sections that can run in parallel for this step, and the number of actually created sections?

+3


source to share


1 answer


There are some good questions here, so go through them individually:

For example, say I have a TaskExecutor that I want to reuse for different parallel steps, and that I have set its size to 20. If I use a TaskExecutorPartitionerHandler with a grid size of 5 and a MultiResourcePartitioner that returns an arbitrary number of partitions (one per file), how will parallelism behave?

TaskExecutorPartitionHandler

overrides the concurrency restrictions TaskExecutor

that you provide. Because of this, your example PartitionHandler

will use up to 20 threads as it allows TaskExecutor

.

If so, when / why is it okay to ignore the "gridSize" parameter when overriding Parititioner with a custom implementation? I think it will help if described in the documentation.



When we look at a partitioned step, there are two components to the problem: Partitioner

and PartitionHandler

. Partitioner

responsible for understanding shared data and how best to do it. PartitionHandler

responsible for delegating this work to the slaves for execution. In order to PartitionHandler

execute his delegation, he must understand the "fabric" with which he works (local threads, remote slave processes, etc.).

When dividing the data to be processed (via Partitioner

), it is helpful to know how many workers are available. However, this metric is not always very useful based on the data you are working with. For example, when splitting database rows, it makes sense to split them evenly by the number of available workers. However, in most scenarios, it is impractical to combine or split files, so it is easy to create a section for each file. Both of these scenarios depend on the data you are trying to partition whether gridSize is useful or not.

If this is not the case, how can I achieve this? That is, how can I reuse the task executor and separately determine the number of sections that can run in parallel for this step, and the number of actually created sections?

If you reuse TaskExecutor

it, you may not be able to, as it TaskExecutor

might do other things. I wonder why you are reusing one of them, given the relatively low overhead of creating a single dedicated one (you can even make it a step covered so that it only gets created when working with a partitioned step).

+3


source







All Articles