Cost if check versus operation?

Here are two different ways I could do, left shifted by> = 64 bits using inline SSEs. The second variation treats the case (shift == 64) on purpose and avoids the single SSE statement, but adds the cost of the if check:

inline __m128i shiftLeftGte64ByBits( const __m128i & a, const unsigned shift )
{
   __m128i r ;

   r = _mm_slli_si128( a, 8 ) ; // a << 64

   r = _mm_sll_epi64( r, _mm_set_epi32( 0, 0, 0, shift - 64 ) ) ;

   return r ;
}

inline __m128i shiftLeftGte64ByBits( const __m128i & a, const unsigned shift )
{
   __m128i r ;

   r = _mm_slli_si128( a, 8 ) ; // a << 64

   if ( shift > 64 )
   {
      r = _mm_sll_epi64( r, _mm_set_epi32( 0, 0, 0, shift - 64 ) ) ;
   }

   return r ;
}

      

I was wondering (roughly) how the cost of this if () check compares to the cost of the shift instruction itself (perhaps relative to the time or number of cycles required for a normal left-left ALU instruction).

+3


source to share


1 answer


Replied to a micro-object using code like:

void timingWithIf( volatile __m128i * pA, volatile unsigned long * pShift, unsigned long n )
{
   __m128i r = *pA ;

   for ( unsigned long i = 0 ; i < n ; i++ )
   {
      r = _mm_slli_si128( r, 8 ) ; // a << 64

      unsigned long shift = *pShift ;

      // does it hurt more to do the check, or just do the operation?
      if ( shift > 64 )
      {
         r = _mm_sll_epi64( r, _mm_set_epi32( 0, 0, 0, shift - 64 ) ) ;
      }
   }

   *pA = r ;
}

      

The following code is generated:

    xor    %eax,%eax
    movdqa (%rdi),%xmm0
    test   %rdx,%rdx
    movdqa %xmm0,0xffffffffffffffe8(%rsp)
    jbe    F0
    pxor   %xmm0,%xmm0
B0: movdqa 0xffffffffffffffe8(%rsp),%xmm2
    pslldq $0x8,%xmm2
    movdqa %xmm2,0xffffffffffffffe8(%rsp)
    mov    (%rsi),%rcx
    cmp    $0x40,%rcx
    jbe    F1
    add    $0xffffffffffffffc0,%rcx
    movd   %ecx,%xmm1
    punpckldq %xmm0,%xmm1
    punpcklqdq %xmm0,%xmm1
    psllq  %xmm1,%xmm2
    movdqa %xmm2,0xffffffffffffffe8(%rsp)
F1: inc    %rax
    cmp    %rdx,%rax
    jb     B0
F0: movdqa 0xffffffffffffffe8(%rsp),%xmm0
    movdqa %xmm0,(%rdi)
    retq
    nopl   0x0(%rax)

      

Note that the shift that the branch avoids actually takes three SSE instructions (four if you can move ALU -> XMM reg), plus one ALU add operation:

    add    $0xffffffffffffffc0,%rcx
    movd   %ecx,%xmm1
    punpckldq %xmm0,%xmm1
    punpcklqdq %xmm0,%xmm1
    psllq  %xmm1,%xmm2

      

With 1 billion loops, I measure:



1) shift == 64:

~ 2.5 s with if (avoiding no-op shift).

~ 2.8s performing a no-op shift.

2) with shift == 65:

~ 2.8 with or without if.

Timing was done on "Intel (R) Xeon (R) CPU X5570 @ 2.93 GHz" (/ proc / cpuinfo) and was relatively consistent.

Even when the branch is completely redundant (shift == 65), I don't see much difference in the time it takes to complete the operation, but it definitely helps to avoid instructions that will shift out of the SSE position to the left when (shift == 64).

+1


source







All Articles