SIMD linear search is slower than flattened loop

Question

SIMD linear search is slower than flattened loop

I am working on an application for which optimized linear search will make a big difference in overall performance, and I have been tasked with improving performance as much as I can.

I search on a vector of 10,000 items that are limited to the sentinel values at the end, I run a linear search at some constant distance from the target item and measure the time it takes to find the item. I select the target elements at random from a set of elements that, after this constant distance from the beginning of the array, allow the search to begin. I am measuring performance using google base platform .

The results I have collected surprised me. I expected the SIMD to break the unwrapped loop at some point for performance, but the gap between the two seems to be growing as the distance it takes to move the array grows. Also, I don't know why a loop that has been unrolled 8 times is faster at shorter distances than a loop that has been unrolled 32 times.

Benchmark                              Time           CPU Iterations
---------------------------------------------------------------------
BM_Search<linUnroll<8>>/2             86 ns         86 ns    7699241
BM_Search<linUnroll<8>>/4            103 ns        103 ns    6797378
BM_Search<linUnroll<8>>/16           650 ns        650 ns    1079095
BM_Search<linUnroll<8>>/64          1365 ns       1365 ns     514196
BM_Search<linUnroll<8>>/256         3558 ns       3558 ns     196519
BM_Search<linUnroll<8>>/1024       12358 ns      12358 ns      56635
BM_Search<linUnroll<8>>/4096       47341 ns      47341 ns      14780
BM_Search<linUnroll<8>>/8192       95029 ns      95030 ns       7367
BM_Search<linUnroll<32>>/2           131 ns        131 ns    5337221
BM_Search<linUnroll<32>>/4           131 ns        131 ns    5329296
BM_Search<linUnroll<32>>/16          291 ns        291 ns    2404646
BM_Search<linUnroll<32>>/64          836 ns        836 ns     831093
BM_Search<linUnroll<32>>/256        2776 ns       2776 ns     252901
BM_Search<linUnroll<32>>/1024      10962 ns      10962 ns      63828
BM_Search<linUnroll<32>>/4096      41312 ns      41312 ns      16941
BM_Search<linUnroll<32>>/8192      83303 ns      83304 ns       8401
BM_Search<linSIMD>/2                 163 ns        163 ns    4304086
BM_Search<linSIMD>/4                 208 ns        208 ns    3354716
BM_Search<linSIMD>/16                366 ns        366 ns    1912122
BM_Search<linSIMD>/64                871 ns        871 ns     803854
BM_Search<linSIMD>/256              3333 ns       3334 ns     210159
BM_Search<linSIMD>/1024            11262 ns      11262 ns      62157
BM_Search<linSIMD>/4096            42656 ns      42656 ns      16413
BM_Search<linSIMD>/8192            87824 ns      87824 ns       7970

I am working on i5-4570 and I have implemented clang 5.0.0. Quick-bench has no AVX and clang is not fully deployed in 3.8, but it should be running. I also tried to expand SIMD and also go to AVX256 instructions but both made the job worse. Why is it that the unrolled loop is much faster? Why does a cycle with a lot of reversal turn so sharply worse than a cycle with less reversal?

The classic diagnosis for SIMD is that you are not doing enough work in SIMD, but I think I am doing enough here.

#include <vector>
#include <cinttypes>
#include <immintrin.h>
typedef int V;
typedef std::vector<V> vi;

long linSIMD(const vi& arr, const long guessIx, const V x) {
  using v4 = V __attribute__ ((vector_size (4*4)));
  using dv2 = int64_t __attribute__ ((vector_size (4*4)));
  constexpr int roll = 4;
  constexpr union {
    int32_t i32[2];
    int64_t i64;
  } skip = {-2,-2};
  v4 xVec = {x,x,x,x};
  for (int i = guessIx;; i += roll) {
    v4 arrVec;
    for (long j = 0; j < 4; j++) arrVec[j] = arr[i+j];
    union {
        v4 i32;
        dv2 i64;
    } cmpVec = {arrVec < xVec};
    v4 cmpVec2 = {cmpVec.i32[3], cmpVec.i32[2], cmpVec.i32[1],cmpVec.i32[0]};
    cmpVec.i32 += cmpVec2;
    if (cmpVec.i64[0] == skip.i64) continue;
    return i - cmpVec.i32[0] - cmpVec.i32[1];
  }
}

long linUnroll32(const vi& arr, const long guessIx, const V x) {
  constexpr int roll = 32;
  for (long i = guessIx;; i += roll)
    for (long j = 0; j < roll; j++)
        if (arr[i+j] >= x) return i+j;
}

http://quick-bench.com/_x_v_WXLWtwvvLsObNlIxjXxS_g https://godbolt.org/g/Wyx2pS

+3

c ++ performance search sse simd

eyepatch Jul 12 17 at 7:42

source to share

2 answers

Use larger batches in a loop in a SIMD package.

For example, use comparison on 4 SIMD registers and then output the result of comparison 16 into one SIMD register. Then put a branch on that (and step out of the loop if a match is found). This way you will have:

fewer branches
more possible parallelization capability for compiler and processor

After you step out of the loop, you need to find the match index among the 16 possible records. You can do this with SIMD or whatever method you prefer.

This path should be faster than your current implementation (for large arrays).

+1

geza Jul 12 17 at 12:38

source to share

harold · Accepted Answer · 2017-07-12T18:34:58+0000

The best I could do (see the quick-bench results ) was

int linSIMD4(const vi& arr, const int guessIx, const int x) {
  auto vecX = _mm_set1_epi32(x - 1);
  const int *ptr = arr.data();
  int i = guessIx;
  // unaligned start
  int misalignment = (uintptr_t)(ptr + i) & 15;
  auto arrVec = _mm_loadu_si128((__m128i*)(ptr + i));
  auto cmp = _mm_cmpgt_epi32(arrVec, vecX);
  int mask = _mm_movemask_ps(_mm_castsi128_ps(cmp));
  if (mask)
    return i + __builtin_ctz(mask);
  // continue with aligned part
  i += (16 - misalignment) / 4;
  for (; ; i += 16) {
    auto av0 = _mm_load_si128((__m128i*)(ptr + i));
    auto av1 = _mm_load_si128((__m128i*)(ptr + i + 4));
    auto av2 = _mm_load_si128((__m128i*)(ptr + i + 8));
    auto av3 = _mm_load_si128((__m128i*)(ptr + i + 12));
    auto cmp0 = _mm_cmpgt_epi32(av0, vecX);
    auto cmp1 = _mm_cmpgt_epi32(av1, vecX);
    auto cmp2 = _mm_cmpgt_epi32(av2, vecX);
    auto cmp3 = _mm_cmpgt_epi32(av3, vecX);
    auto cmp = _mm_packs_epi16(_mm_packs_epi32(cmp0, cmp1), _mm_packs_epi32(cmp2, cmp3));
    int mask = _mm_movemask_epi8(cmp);
    if (mask)
      return i + __builtin_ctz(mask);
  }
}

This is basically what the gesa described, but I added a special first iteration to align the data for the main loop. Loads that cross cache line boundaries (or page boundaries) are slower, this saves them. The overhead does not cost short distances (under low enough load), on the other hand, it should be faster for short distances (less than 4).

I also tried to reverse the condition ( linSIMD5

) using (a >= b) = !(b > a)

, with non-destructive AVX encoding, which would allow for the vcmpgtd

concatenation and load (reducing μops in the fused domain), but fast -bench doesn't do AVX, so just ignore the result and try it yourself.

There's AVX2 version at the bottom, I haven't tried or compared it. It doesn't use the load / compare-merging trick (which may or may not help), but it's easy to adapt.

SIMD linear search is slower than flattened loop

More articles: