How to make this SSIS script more parallel

I have a million rows in a database table. For each line, I have to run a custom exe, parse the output and update another database table

How do I run multiple lines in parallel?

Now I have a simple dataflow task -> GetData-> Run Script (Run Process, Parse Output) -> Store Data It took 3 hours for 6000 rows. Too much.

+1


source to share


2 answers


There is one bottleneck here, executing the process for each line. Increasing "EngineThreads" would not help at all, as there will only be one thread doing that particular script conversion. The time spent in other transformations probably doesn't matter at all. Processes are heavy objects, and thousands of them will never be cheap.

I can think of the following ideas to make it better:

1) The best way to fix this is to convert your custom EXE to assembly and call it from within the conversion script - to avoid the overhead of creating processes, parsing results, etc.

2) If you need to use separate processes, you can try running these processes in parallel. This will help if the process is mostly waiting for I / O (i.e. I / O bindings). If the processes are memory bound or CPU bound, you won't gain much by running them in parallel.

2A) Complex script, simple package.



To run them in parallel, change the ProcessInput method in your script to start the process asynchronously and not wait for the process to finish - go to the next line and create the next process. Subscribe to the output process and handle the output event so you know when it finishes. Limit the number of processes running in parallel, otherwise you will run out of memory. Wait for all processes to complete before returning from the ProcessInput call.

2B) Simple script, complex package.

Keep the current sequential script, but split the data using SSIS. Add a conditional split transform and split the input stream into multiple streams based on some hash expression - something that will make each output get roughly the same amount of data. The number of threads is equal to the number of process instances that you want to run in parallel. Add a transform script to each conditional split output. Now you must also increase the "Engine Threads" property :) and these conversions will be done in parallel. (Note: based on the tag, I am assuming you are using SSIS 2008. You will need to insert additional Union All transformations for it to work in SSIS 2005).

This should make it work better, but millions of processes are many. You are unlikely to get really good performance here.

+3


source


If you are executing this process using a "data stream" container then it has a property "EngineThreads" which by default is 5. You can set it to a higher number, for example 20, which will dedicate more threads to process these lines ...



This is just a performance optimization or optmisation, if your ssis package is still very slow I would probably consider the architecture and design of your package.

+1


source







All Articles