SSE FP module detects 0.0 operands?

According to my previous question , my idea was to optimize the algorithm by removing computations when the coefficient m_a, m_b is 1.0 or 0.0. Now I tried to optimize the algorithm and got some curious results that I cannot explain.

The first analyzer runs for 100k samples. Parameter values ​​are read from file (!):

b0 = 1.0 b1 = -1.480838022915731 b2 = 1.0

a0 = 1.0 a1 = -1.784147570544337 a2 = 0.854309980957510

Slow

The second analyzer works with the same 100k samples. Parameter values ​​are read from file (!):

b0 = 1.0 b1 = -1.480838022915731 b2 = 1.0

a0 = 1.0 a1 = -1.784147570544337 a2 = 0.0 <--- Only a2 is different!

Fast

Within the digits, the digits on the left (gray background) represent the required processor cycles. As you can see, the second launch with a2 = 0.0 is much faster.

I checked the difference between debug and release code. Release code is faster (as expected). The debug and release code has the same strange behavior when the a2 parameter is changed.

Then I checked the ASM code. I noticed that SSE instructions are being used. This is correct because I compiled with / arch: SSE2. So I disabled SSE. The resulting code no longer uses SSE and the performance is no longer affected by the value of the a2 parameter (as expected)

So I concluded that their performance advantage is when using SSE, and the SSE engine detects that a2 is 0.0 and therefore eliminates legacy multiplication and subtraction. I've never heard of this and tried to find information, but to no avail.

Anyone have an explanation for my work results?

For completeness, this is the corresponding ASM code for the release version:

00F43EC0  mov         edx,dword ptr [ebx]  
00F43EC2  movss       xmm0,dword ptr [eax+edi*4]  
00F43EC7  cmp         edx,dword ptr [ebx+4]  
00F43ECA  je          $LN419+193h (0F43F9Dh)  
00F43ED0  mov         esi,dword ptr [ebx+4]  
00F43ED3  lea         eax,[edx+68h]  
00F43ED6  lea         ecx,[eax-68h]  
00F43ED9  cvtps2pd    xmm0,xmm0  
00F43EDC  cmp         ecx,esi  
00F43EDE  je          $LN419+180h (0F43F8Ah)  
00F43EE4  movss       xmm1,dword ptr [eax+4]  
00F43EE9  mov         ecx,dword ptr [eax]  
00F43EEB  mov         edx,dword ptr [eax-24h]  
00F43EEE  movss       xmm3,dword ptr [edx+4]  
00F43EF3  cvtps2pd    xmm1,xmm1  
00F43EF6  mulsd       xmm1,xmm0  
00F43EFA  movss       xmm0,dword ptr [ecx]  
00F43EFE  cvtps2pd    xmm4,xmm0  
00F43F01  cvtps2pd    xmm3,xmm3  
00F43F04  mulsd       xmm3,xmm4  
00F43F08  xorps       xmm2,xmm2  
00F43F0B  cvtpd2ps    xmm2,xmm1  
00F43F0F  movss       xmm1,dword ptr [ecx+4]  
00F43F14  cvtps2pd    xmm4,xmm1  
00F43F17  cvtps2pd    xmm2,xmm2  
00F43F1A  subsd       xmm2,xmm3  
00F43F1E  movss       xmm3,dword ptr [edx+8]  
00F43F23  mov         edx,dword ptr [eax-48h]  
00F43F26  cvtps2pd    xmm3,xmm3  
00F43F29  mulsd       xmm3,xmm4  
00F43F2D  subsd       xmm2,xmm3  
00F43F31  movss       xmm3,dword ptr [edx+4]  
00F43F36  cvtps2pd    xmm4,xmm0  
00F43F39  cvtps2pd    xmm3,xmm3  
00F43F3C  mulsd       xmm3,xmm4  
00F43F40  movss       xmm4,dword ptr [edx]  
00F43F44  cvtps2pd    xmm4,xmm4  
00F43F47  cvtpd2ps    xmm2,xmm2  
00F43F4B  xorps       xmm5,xmm5  
00F43F4E  cvtss2sd    xmm5,xmm2  
00F43F52  mulsd       xmm4,xmm5  
00F43F56  addsd       xmm3,xmm4  
00F43F5A  movss       xmm4,dword ptr [edx+8]  
00F43F5F  cvtps2pd    xmm1,xmm1  
00F43F62  movss       dword ptr [ecx+4],xmm0  
00F43F67  mov         edx,dword ptr [eax]  
00F43F69  cvtps2pd    xmm4,xmm4  
00F43F6C  mulsd       xmm4,xmm1  
00F43F70  addsd       xmm3,xmm4  
00F43F74  xorps       xmm1,xmm1  
00F43F77  cvtpd2ps    xmm1,xmm3  
00F43F7B  movss       dword ptr [edx],xmm2  
00F43F7F  movaps      xmm0,xmm1  
00F43F82  add         eax,70h  
00F43F85  jmp         $LN419+0CCh (0F43ED6h)  
00F43F8A  movss       xmm1,dword ptr [ebx+10h]  
00F43F8F  cvtps2pd    xmm1,xmm1  
00F43F92  mulsd       xmm1,xmm0  
00F43F96  xorps       xmm0,xmm0  
00F43F99  cvtpd2ps    xmm0,xmm1  
00F43F9D  mov         eax,dword ptr [ebp-4Ch]  
00F43FA0  movss       dword ptr [eax+edi*4],xmm0  
00F43FA5  mov         ecx,dword ptr [ebp-38h]  
00F43FA8  mov         eax,dword ptr [ebp-3Ch]  
00F43FAB  sub         ecx,eax  
00F43FAD  inc         edi  
00F43FAE  sar         ecx,2  
00F43FB1  cmp         edi,ecx  
00F43FB3  jb          $LN419+0B6h (0F43EC0h)  

      

Edit: Replace ASM debug code with release code.

+3


source to share


2 answers


There are no early outs for FP multiplication in SSE. This is a fully pipelined operation with short latency, so adding early outs will make retirement more difficult while providing zero benefit. The only instructions that typically have data-dependent execution characteristics on modern processors are divisible and square root (ignoring subnormal functions that affect a wider instruction set). This is widely documented by both Intel and AMD, as well as independently of Agner Fog.

So why are you seeing a change in performance? The most likely explanation is that you are facing kiosks because of subnormal inputs or results; this is very common with DSP filters and delays like the one you have. Without seeing your code and input, it is impossible to be sure that this is what is happening, but this is by far the most likely explanation. If so, you can fix the problem by setting the DAZ and FTZ bit in MXCSR.

Intel Documentation: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf (consult the waiting tables in the app, note that there is a fixed value for mulss

and mulsd

.)



AMD 16h Learning Latency (Excel Table): http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/AMD64_16h_InstrLatency_1.1.xlsx

Agner Fog instruction latency tables for Intel and AMD: http://www.agner.org/optimize/instruction_tables.pdf

+7


source


This would be normal behavior if the FP HW multiply unit was performing operational operations. You can see it here . This means that when HW detects a value of 0.0, it does not pass it along the entire pipeline.

However, you are using the SSE mulsd statement . In his post, Stephen Canon pointed out that Intel and AMD implementations have fixed latency for the mulsd command . This indicates a lack of early SSE-related functionality.

Also Stephen Canon pointed out that performance problems arise when using denormal numbers. In this post, you can read more about what is causing this.



However, all of your coefficients are frequent values ​​that don't appear to be denormals . So the problem could be somewhere else. All ASM instructions in your code are documented with fixed delays, but the huge loop difference indicates that something is happening.

Your profiling profile shows that all of your delays have been changed, even if the 0.0 factor appears in only a few combinations. Is the result calculated correctly? Ar all other variables constant between runs?

+4


source







All Articles