Cost if check versus operation?
Here are two different ways I could do, left shifted by> = 64 bits using inline SSEs. The second variation treats the case (shift == 64) on purpose and avoids the single SSE statement, but adds the cost of the if check:
inline __m128i shiftLeftGte64ByBits( const __m128i & a, const unsigned shift )
{
__m128i r ;
r = _mm_slli_si128( a, 8 ) ; // a << 64
r = _mm_sll_epi64( r, _mm_set_epi32( 0, 0, 0, shift - 64 ) ) ;
return r ;
}
inline __m128i shiftLeftGte64ByBits( const __m128i & a, const unsigned shift )
{
__m128i r ;
r = _mm_slli_si128( a, 8 ) ; // a << 64
if ( shift > 64 )
{
r = _mm_sll_epi64( r, _mm_set_epi32( 0, 0, 0, shift - 64 ) ) ;
}
return r ;
}
I was wondering (roughly) how the cost of this if () check compares to the cost of the shift instruction itself (perhaps relative to the time or number of cycles required for a normal left-left ALU instruction).
source to share
Replied to a micro-object using code like:
void timingWithIf( volatile __m128i * pA, volatile unsigned long * pShift, unsigned long n )
{
__m128i r = *pA ;
for ( unsigned long i = 0 ; i < n ; i++ )
{
r = _mm_slli_si128( r, 8 ) ; // a << 64
unsigned long shift = *pShift ;
// does it hurt more to do the check, or just do the operation?
if ( shift > 64 )
{
r = _mm_sll_epi64( r, _mm_set_epi32( 0, 0, 0, shift - 64 ) ) ;
}
}
*pA = r ;
}
The following code is generated:
xor %eax,%eax
movdqa (%rdi),%xmm0
test %rdx,%rdx
movdqa %xmm0,0xffffffffffffffe8(%rsp)
jbe F0
pxor %xmm0,%xmm0
B0: movdqa 0xffffffffffffffe8(%rsp),%xmm2
pslldq $0x8,%xmm2
movdqa %xmm2,0xffffffffffffffe8(%rsp)
mov (%rsi),%rcx
cmp $0x40,%rcx
jbe F1
add $0xffffffffffffffc0,%rcx
movd %ecx,%xmm1
punpckldq %xmm0,%xmm1
punpcklqdq %xmm0,%xmm1
psllq %xmm1,%xmm2
movdqa %xmm2,0xffffffffffffffe8(%rsp)
F1: inc %rax
cmp %rdx,%rax
jb B0
F0: movdqa 0xffffffffffffffe8(%rsp),%xmm0
movdqa %xmm0,(%rdi)
retq
nopl 0x0(%rax)
Note that the shift that the branch avoids actually takes three SSE instructions (four if you can move ALU -> XMM reg), plus one ALU add operation:
add $0xffffffffffffffc0,%rcx
movd %ecx,%xmm1
punpckldq %xmm0,%xmm1
punpcklqdq %xmm0,%xmm1
psllq %xmm1,%xmm2
With 1 billion loops, I measure:
1) shift == 64:
~ 2.5 s with if (avoiding no-op shift).
~ 2.8s performing a no-op shift.
2) with shift == 65:
~ 2.8 with or without if.
Timing was done on "Intel (R) Xeon (R) CPU X5570 @ 2.93 GHz" (/ proc / cpuinfo) and was relatively consistent.
Even when the branch is completely redundant (shift == 65), I don't see much difference in the time it takes to complete the operation, but it definitely helps to avoid instructions that will shift out of the SSE position to the left when (shift == 64).
source to share