Performance improvement for loading batches in Azure Table
I am trying to load about 25 million rows from an Azure SQL table to three different tables in Azure Table Storage. I currently manage to process about 50-100 lines per second, which means that at current speeds it will take me about 70-140 hours to complete the download. This is a long time and it seems like it should be possible to speed it up.
That's what I'm doing:
- Disable 10 separate tasks.
- For each task, read the following 10,000 raw records from SQL DB
- For each of the three target ATS tables, group 10,000 records with this table partition key
- In parallel (up to 10 at a time) for each partition key, segment the partition into a (maximum) segment of 100 lines
- Create a new one in parallel (up to 10 simultaneously) for each segment
TableBatchOperation
. - For each line from the block, execute a batch.InsertOrReplace () statement (because some of the data has already been loaded and I don't know which one)
- Execute package asynchronously
- Rinse and repeat (with a lot of flow control, error checking, etc.).
Some notes:
- I've tried this in a couple of different ways, with many different parameters for different numbers above, and I still don't get it down to less than 10-20ms / events.
- This seems to be unrelated to the CPU as the VM doing the load averages around 10-20% of the CPU.
- This is not like SQL binding, since the SQL select statement is the fastest part of the operation by at least two orders of magnitude.
- It is presumably not connected to the network as the virtual machine running the package is located in the same datacenter (US West).
- I am getting a reasonable partition density i.e. each 10K recordset is split into several hundred partitions for each table.
- With an ideal partitioning density, I would run up to 3000 tasks simultaneously (10 core tasks * 3 tables * 10 partitions * 10 segments). But they run asynchronously, and they are almost all I / O bound (over ATS), so I don't think we are pushing any limits on threads on the VM doing this process.
The only other obvious idea I can come up with is the one I tried before, namely to make the section key order by
in the SQL select statements so that we can get the perfect density of the batch inserts. For various reasons, which proved to be difficult, since the indexes on the table are not quite set up to do this. And while I would expect some speed on the ATS side using this approach, given that I am already grouping 10K records by their partition keys, I would not expect this to significantly improve performance.
Any other suggestions to speed this up? Or is it about as fast as any other?
source to share
Still open to other suggestions, but I found this page quite helpful here:
http://blogs.msmvps.com/nunogodinho/2013/11/20/windows-azure-storage-performance-best-practices/
In particular, they:
ServicePointManager.Expect100Continue = false;
ServicePointManager.UseNagleAlgorithm = false;
ServicePointManager.DefaultConnectionLimit = 100;
With this, I managed to drop the average processing time from ~ 10-20 ms / event to ~ 2 ms. Much better.
But as I said, still open to other suggestions. I read about other people getting over 20,000 operations per second on the ATS and I was still sticking to around 500.
source to share
How about your sections? If they are incremental numbers, Azure will optimize them in the same node storage. Therefore, you must use completely different section keys "A1", "B2", etc. Instead of "1", "2", etc. In this situation, all of your partitions will be handled by different storage nodes and performance will be multitasking.
source to share