SSE FP module detects 0.0 operands?
According to my previous question , my idea was to optimize the algorithm by removing computations when the coefficient m_a, m_b is 1.0 or 0.0. Now I tried to optimize the algorithm and got some curious results that I cannot explain.
The first analyzer runs for 100k samples. Parameter values ββare read from file (!):
b0 = 1.0 b1 = -1.480838022915731 b2 = 1.0
a0 = 1.0 a1 = -1.784147570544337 a2 = 0.854309980957510
The second analyzer works with the same 100k samples. Parameter values ββare read from file (!):
b0 = 1.0 b1 = -1.480838022915731 b2 = 1.0
a0 = 1.0 a1 = -1.784147570544337 a2 = 0.0 <--- Only a2 is different!
Within the digits, the digits on the left (gray background) represent the required processor cycles. As you can see, the second launch with a2 = 0.0 is much faster.
I checked the difference between debug and release code. Release code is faster (as expected). The debug and release code has the same strange behavior when the a2 parameter is changed.
Then I checked the ASM code. I noticed that SSE instructions are being used. This is correct because I compiled with / arch: SSE2. So I disabled SSE. The resulting code no longer uses SSE and the performance is no longer affected by the value of the a2 parameter (as expected)
So I concluded that their performance advantage is when using SSE, and the SSE engine detects that a2 is 0.0 and therefore eliminates legacy multiplication and subtraction. I've never heard of this and tried to find information, but to no avail.
Anyone have an explanation for my work results?
For completeness, this is the corresponding ASM code for the release version:
00F43EC0 mov edx,dword ptr [ebx]
00F43EC2 movss xmm0,dword ptr [eax+edi*4]
00F43EC7 cmp edx,dword ptr [ebx+4]
00F43ECA je $LN419+193h (0F43F9Dh)
00F43ED0 mov esi,dword ptr [ebx+4]
00F43ED3 lea eax,[edx+68h]
00F43ED6 lea ecx,[eax-68h]
00F43ED9 cvtps2pd xmm0,xmm0
00F43EDC cmp ecx,esi
00F43EDE je $LN419+180h (0F43F8Ah)
00F43EE4 movss xmm1,dword ptr [eax+4]
00F43EE9 mov ecx,dword ptr [eax]
00F43EEB mov edx,dword ptr [eax-24h]
00F43EEE movss xmm3,dword ptr [edx+4]
00F43EF3 cvtps2pd xmm1,xmm1
00F43EF6 mulsd xmm1,xmm0
00F43EFA movss xmm0,dword ptr [ecx]
00F43EFE cvtps2pd xmm4,xmm0
00F43F01 cvtps2pd xmm3,xmm3
00F43F04 mulsd xmm3,xmm4
00F43F08 xorps xmm2,xmm2
00F43F0B cvtpd2ps xmm2,xmm1
00F43F0F movss xmm1,dword ptr [ecx+4]
00F43F14 cvtps2pd xmm4,xmm1
00F43F17 cvtps2pd xmm2,xmm2
00F43F1A subsd xmm2,xmm3
00F43F1E movss xmm3,dword ptr [edx+8]
00F43F23 mov edx,dword ptr [eax-48h]
00F43F26 cvtps2pd xmm3,xmm3
00F43F29 mulsd xmm3,xmm4
00F43F2D subsd xmm2,xmm3
00F43F31 movss xmm3,dword ptr [edx+4]
00F43F36 cvtps2pd xmm4,xmm0
00F43F39 cvtps2pd xmm3,xmm3
00F43F3C mulsd xmm3,xmm4
00F43F40 movss xmm4,dword ptr [edx]
00F43F44 cvtps2pd xmm4,xmm4
00F43F47 cvtpd2ps xmm2,xmm2
00F43F4B xorps xmm5,xmm5
00F43F4E cvtss2sd xmm5,xmm2
00F43F52 mulsd xmm4,xmm5
00F43F56 addsd xmm3,xmm4
00F43F5A movss xmm4,dword ptr [edx+8]
00F43F5F cvtps2pd xmm1,xmm1
00F43F62 movss dword ptr [ecx+4],xmm0
00F43F67 mov edx,dword ptr [eax]
00F43F69 cvtps2pd xmm4,xmm4
00F43F6C mulsd xmm4,xmm1
00F43F70 addsd xmm3,xmm4
00F43F74 xorps xmm1,xmm1
00F43F77 cvtpd2ps xmm1,xmm3
00F43F7B movss dword ptr [edx],xmm2
00F43F7F movaps xmm0,xmm1
00F43F82 add eax,70h
00F43F85 jmp $LN419+0CCh (0F43ED6h)
00F43F8A movss xmm1,dword ptr [ebx+10h]
00F43F8F cvtps2pd xmm1,xmm1
00F43F92 mulsd xmm1,xmm0
00F43F96 xorps xmm0,xmm0
00F43F99 cvtpd2ps xmm0,xmm1
00F43F9D mov eax,dword ptr [ebp-4Ch]
00F43FA0 movss dword ptr [eax+edi*4],xmm0
00F43FA5 mov ecx,dword ptr [ebp-38h]
00F43FA8 mov eax,dword ptr [ebp-3Ch]
00F43FAB sub ecx,eax
00F43FAD inc edi
00F43FAE sar ecx,2
00F43FB1 cmp edi,ecx
00F43FB3 jb $LN419+0B6h (0F43EC0h)
Edit: Replace ASM debug code with release code.
source to share
There are no early outs for FP multiplication in SSE. This is a fully pipelined operation with short latency, so adding early outs will make retirement more difficult while providing zero benefit. The only instructions that typically have data-dependent execution characteristics on modern processors are divisible and square root (ignoring subnormal functions that affect a wider instruction set). This is widely documented by both Intel and AMD, as well as independently of Agner Fog.
So why are you seeing a change in performance? The most likely explanation is that you are facing kiosks because of subnormal inputs or results; this is very common with DSP filters and delays like the one you have. Without seeing your code and input, it is impossible to be sure that this is what is happening, but this is by far the most likely explanation. If so, you can fix the problem by setting the DAZ and FTZ bit in MXCSR.
Intel Documentation:
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf (consult the waiting tables in the app, note that there is a fixed value for mulss
and mulsd
.)
AMD 16h Learning Latency (Excel Table): http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/AMD64_16h_InstrLatency_1.1.xlsx
Agner Fog instruction latency tables for Intel and AMD: http://www.agner.org/optimize/instruction_tables.pdf
source to share
This would be normal behavior if the FP HW multiply unit was performing operational operations. You can see it here . This means that when HW detects a value of 0.0, it does not pass it along the entire pipeline.
However, you are using the SSE mulsd statement . In his post, Stephen Canon pointed out that Intel and AMD implementations have fixed latency for the mulsd command . This indicates a lack of early SSE-related functionality.
Also Stephen Canon pointed out that performance problems arise when using denormal numbers. In this post, you can read more about what is causing this.
However, all of your coefficients are frequent values ββthat don't appear to be denormals . So the problem could be somewhere else. All ASM instructions in your code are documented with fixed delays, but the huge loop difference indicates that something is happening.
Your profiling profile shows that all of your delays have been changed, even if the 0.0 factor appears in only a few combinations. Is the result calculated correctly? Ar all other variables constant between runs?
source to share