Processing a Million Records as a Batch in BizTalk
I am considering suggestions on how to handle this and if I am using the right tool for the job. I work mainly on BizTalk and we are currently using BizTalk 2013 R2 with SQL 2014.
Problem:
We will receive positional flat files every day (about 50) from different partners, and the theoretical total number of records received will be over a million records. Each entry contains some identifying information that needs to be sent to the web service, which will be returned essentially with a YES or NO, based on which the incoming file is split into two files.
Originally, the daily expected records were 10,000, which later skyrocketed to 100,000 and are now a million records.
Attempt 1: Scatter-Gather pattern
I am handling writes in a custom pipeline with a file disassembler, adding a couple of custom port parameters for the scatter part (following Richard Serother's suggestion for implementing round robin) where I control the amount of scatter / working orchestration, I wiggle out to call the web service, and tag records to be sent to "Agency A" or "Agency B" and finally press the control message, which will start the Gather / Aggregator orchestration, which collects all messages that are processed from workers to the mailbox through correlation and creates two files for routing to Agency A and Agency B.
Thus, each file that is removed will have its own set of workers and aggregator that will process the file.
This works well for files with fewer records, but if the file has over 100k records, I see throttling happening and the file takes a long time to process and generate two files.
I have put the receiver / worker and aggregator / send port on separate hosts. It looks like the collector is getting dehydrated and doesn't actually aggregate the records processed by the workers until all of them are processed, and I think since the ratio of published msgs to processed files is very high, it throttles.
Approach 2:
Assuming aggregator coordination is the bottleneck, instead of piling up them in orchestration, I pushed the processed records to the SQL db and "split" the records into two XML files (basically concatenating msgs messages to agency A / B and wrap it in XML declarations and use the correct msg type based on writing some context properties in the SQL table along with the record). These aggregated XML records are processed and routed to the correct agencies.
This seems to work fine with 100k records, and completes within a reasonable amount of time. Now that the target / requirement of the target has changed again in terms of the expected volume, I am trying to find out if BizTalk is still possible.
I pointed out that BT is not the right tool for the job to do such a task, but the client suggests that we add more servers to get it working. I am looking at SSIS.
Meanwhile, doing some tests, some observations:
- Increasing the number of workers improved processing (duh): It looks like if each worker processed fewer queue / subscription entries, they quickly finished their turn. When testing this file records 100,000 using 100 workers completed in less than 3 hours. This is with minimal activity on the server from other applications. I am trying to get the webservices hosting group to give me a theoretical maximum of no concurrent connection that they can handle. I tend to ask them if they can handle 1000 calls and maybe the existing solution will scale with my observations.
I adjusted a few settings for the host regarding the number of messages and the physical memory threshold so that it doesn't overlap the volume, but I'm still not sure. I didn't need to mess with these settings before and use tips to monitor any specific counters.
The post is a bit long, but I hope this gives an idea of ββwhat I have done so far. Any help / understanding was appreciated in solving this problem. If you suggest alternatives, I will limit myself to .NET or MS tools / frames, but would love to hear other options as well.
I will try to answer or give more details if you want to clarify or understand what I have not clarified.
source to share
First, 1 million posts / posts is not a problem, but you can make it a problem by treating it badly.
Here is the sample I posted first.
- Upload records to SQL Server using SSIS. It will be very fast.
- Process / delete entries in the BizTalk app for ... well, what needs to be done. Service call etc.
- Update the SQL record with the result.
- When this process is complete, request the "Yes" and "No" packets as one (large) message, convert and send.
My guess is that a web service will become a bottleneck unless it is specifically designed for that kind of load. You will probably only need to adjust BizTalk to the throttle when needed, but don't worry about that for now. A good app template is more important.
source to share
In such scenarios, you should consider the following approach:
- File decomposition and saving individual records in MSMQ. You can achieve this easily without any additional coding effort, all you need to do is create a send port using an MSMQ adapter or WCF UI with netmsmq binding. If necessary, you can also create separate queues based on different criteria that may arise in your messages.
- Receive messages from MSMQ using the receive location on a separate host.
- Submit them to a web service on another BizTalk site.
- Try to use scripts for messaging only, you can handle the service response using a pipeline component if needed. You can use Map on send port. In the worst case, if you need orchestration, you only need to handle the processing of a single message without any complex template.
- You can post messages again to two MSMQs for two different agencies based on the web service response.
- You can then receive these messages again and write them to a file, you can simply use the send port with the FileAppend parameter, or you can use your own pipeline component to write the received messages to a file without combining them into orchestration. You can put them together in orchestration if there are no more than a few thousand messages in the file.
- With this approach, you won't have a bottleneck in BizTalk, and you don't have to use a complex orchestration pattern that usually has many constant points.
- If a web service becomes a bottleneck, you can control the speed of the received message from MSMQ using: 1) customized delivery at the MSMQ receiving location and, if required, 2) using BizTalk throttling by changing two properties: the number of messages in Db is very a low number like 1000 from the default 50K and increasing the fill factor and data tracking, for example. 500 out of 10 is the default, to ensure that multiplying both numbers is sufficient to avoid throttling from BizTalk messages. You can also reduce the number of worker threads on the BizTalk host to make it a bit slow.
- Please note that MSMQ is part of the Windows OS and does not require any additional configuration. Usually installed by default if you cannot add add-remove functionality. You can also use IBM MQ if your organization has infrastructure. But for a million messages, MSMQ will be fine.
source to share
Apologies for the last update * We decided to use SSIS to bulk import the file into a table and since the search web service is part of the same organization and network, although using a different stack, they agreed to let us call their lookup table on which theirs the service web page is based on and we use a "join" between these tables to identify "Y" or "N" and export them also via SSIS.
In short, we missed out on using BT. The time it now takes is a couple of minutes to process the 1.5 million record files and send the split files.
Check out all the recommendations provided here.
source to share