U-SQL performance

Can you help me with my job? I started it with 10 AU. And at first they are used by almost everyone. But from the second half of the runtime, it only uses 1 AU. I see one supervertex in the plan with only one vertex, it looks like an underestimated execution plan (this is just a guess).

I am trying to analyze the runtime, but it is difficult without a technical description of operations like HashCombine, HashCross, ...

So my question is, is there anything you can do with it (change the code, add hints, etc.)?

my work id link

enter image description here

The issue was resolved with Mychael Rys solution.

I followed Michael Rys' solution and it works great. Thanks as always! See Fig. Below. Almost all 10AUs out of 10AUs are currently in use. Also I was playing with the modeling tool and it looked like the script scaled almost linearly. Awesome :).

enter image description here

Another solution

Also I can replace inner joins with left joins (the replacement would be equivalent in my case, because in ALWAYS dimension tables there is only one row for any record in the dim-1: M-fact fact table). The CBO score combines the cardinality of the results as "at least no less than the fact table." In case the CBO generates a good plan with no hints.

+3


source to share


1 answer


I'll post your work link to one of our developers who can take a look and update this answer as soon as I receive more information.

However for stackoverflow reference it would be helpful to see the script and / or schedule of the job. For example, how much data are you using? Are you using an operation that implies ordering or grouping, etc.

Based on the vertex execution view, it appears that you are pulling from a large number of small files, each containing only a small amount of data. It is possible that the optimizer assumes that only a small amount of data is coming from these files.

You can add a hint OPTION(ROWCOUNT= xxx)

to the operator EXTRACT

to hint at more lines ( xxx

will be a number to force the system to parallelize), assuming my initial guess is correct.

Additional information after viewing the work



The plan is a 13-way connection with 12 dimension tables and 1 fact table. The error (underestimation causing a sequential plan) starts after 9 of 12 joins are completed. The last 3 connections - with dim_application, dim_operation and dim_data_type, are done serially. The spine of the plan still has 29GB. It is very difficult to evaluate it across 9 unions, since we have no foreign key information.

Chances are you can make this work

  • Divide the join operator by 2, connecting with dim_application, dim_operation and dim_data_type in the second.
  • Add a hint ROWCOUNT

    to the output of the first join operator with a large number.

Let me know if it helps.

+1


source







All Articles