SIMD linear search is slower than flattened loop
I am working on an application for which optimized linear search will make a big difference in overall performance, and I have been tasked with improving performance as much as I can.
I search on a vector of 10,000 items that are limited to the sentinel values ββat the end, I run a linear search at some constant distance from the target item and measure the time it takes to find the item. I select the target elements at random from a set of elements that, after this constant distance from the beginning of the array, allow the search to begin. I am measuring performance using google base platform .
The results I have collected surprised me. I expected the SIMD to break the unwrapped loop at some point for performance, but the gap between the two seems to be growing as the distance it takes to move the array grows. Also, I don't know why a loop that has been unrolled 8 times is faster at shorter distances than a loop that has been unrolled 32 times.
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_Search<linUnroll<8>>/2 86 ns 86 ns 7699241
BM_Search<linUnroll<8>>/4 103 ns 103 ns 6797378
BM_Search<linUnroll<8>>/16 650 ns 650 ns 1079095
BM_Search<linUnroll<8>>/64 1365 ns 1365 ns 514196
BM_Search<linUnroll<8>>/256 3558 ns 3558 ns 196519
BM_Search<linUnroll<8>>/1024 12358 ns 12358 ns 56635
BM_Search<linUnroll<8>>/4096 47341 ns 47341 ns 14780
BM_Search<linUnroll<8>>/8192 95029 ns 95030 ns 7367
BM_Search<linUnroll<32>>/2 131 ns 131 ns 5337221
BM_Search<linUnroll<32>>/4 131 ns 131 ns 5329296
BM_Search<linUnroll<32>>/16 291 ns 291 ns 2404646
BM_Search<linUnroll<32>>/64 836 ns 836 ns 831093
BM_Search<linUnroll<32>>/256 2776 ns 2776 ns 252901
BM_Search<linUnroll<32>>/1024 10962 ns 10962 ns 63828
BM_Search<linUnroll<32>>/4096 41312 ns 41312 ns 16941
BM_Search<linUnroll<32>>/8192 83303 ns 83304 ns 8401
BM_Search<linSIMD>/2 163 ns 163 ns 4304086
BM_Search<linSIMD>/4 208 ns 208 ns 3354716
BM_Search<linSIMD>/16 366 ns 366 ns 1912122
BM_Search<linSIMD>/64 871 ns 871 ns 803854
BM_Search<linSIMD>/256 3333 ns 3334 ns 210159
BM_Search<linSIMD>/1024 11262 ns 11262 ns 62157
BM_Search<linSIMD>/4096 42656 ns 42656 ns 16413
BM_Search<linSIMD>/8192 87824 ns 87824 ns 7970
I am working on i5-4570 and I have implemented clang 5.0.0. Quick-bench has no AVX and clang is not fully deployed in 3.8, but it should be running. I also tried to expand SIMD and also go to AVX256 instructions but both made the job worse. Why is it that the unrolled loop is much faster? Why does a cycle with a lot of reversal turn so sharply worse than a cycle with less reversal?
The classic diagnosis for SIMD is that you are not doing enough work in SIMD, but I think I am doing enough here.
#include <vector>
#include <cinttypes>
#include <immintrin.h>
typedef int V;
typedef std::vector<V> vi;
long linSIMD(const vi& arr, const long guessIx, const V x) {
using v4 = V __attribute__ ((vector_size (4*4)));
using dv2 = int64_t __attribute__ ((vector_size (4*4)));
constexpr int roll = 4;
constexpr union {
int32_t i32[2];
int64_t i64;
} skip = {-2,-2};
v4 xVec = {x,x,x,x};
for (int i = guessIx;; i += roll) {
v4 arrVec;
for (long j = 0; j < 4; j++) arrVec[j] = arr[i+j];
union {
v4 i32;
dv2 i64;
} cmpVec = {arrVec < xVec};
v4 cmpVec2 = {cmpVec.i32[3], cmpVec.i32[2], cmpVec.i32[1],cmpVec.i32[0]};
cmpVec.i32 += cmpVec2;
if (cmpVec.i64[0] == skip.i64) continue;
return i - cmpVec.i32[0] - cmpVec.i32[1];
}
}
long linUnroll32(const vi& arr, const long guessIx, const V x) {
constexpr int roll = 32;
for (long i = guessIx;; i += roll)
for (long j = 0; j < roll; j++)
if (arr[i+j] >= x) return i+j;
}
http://quick-bench.com/_x_v_WXLWtwvvLsObNlIxjXxS_g https://godbolt.org/g/Wyx2pS
source to share
The best I could do (see the quick-bench results ) was
int linSIMD4(const vi& arr, const int guessIx, const int x) {
auto vecX = _mm_set1_epi32(x - 1);
const int *ptr = arr.data();
int i = guessIx;
// unaligned start
int misalignment = (uintptr_t)(ptr + i) & 15;
auto arrVec = _mm_loadu_si128((__m128i*)(ptr + i));
auto cmp = _mm_cmpgt_epi32(arrVec, vecX);
int mask = _mm_movemask_ps(_mm_castsi128_ps(cmp));
if (mask)
return i + __builtin_ctz(mask);
// continue with aligned part
i += (16 - misalignment) / 4;
for (; ; i += 16) {
auto av0 = _mm_load_si128((__m128i*)(ptr + i));
auto av1 = _mm_load_si128((__m128i*)(ptr + i + 4));
auto av2 = _mm_load_si128((__m128i*)(ptr + i + 8));
auto av3 = _mm_load_si128((__m128i*)(ptr + i + 12));
auto cmp0 = _mm_cmpgt_epi32(av0, vecX);
auto cmp1 = _mm_cmpgt_epi32(av1, vecX);
auto cmp2 = _mm_cmpgt_epi32(av2, vecX);
auto cmp3 = _mm_cmpgt_epi32(av3, vecX);
auto cmp = _mm_packs_epi16(_mm_packs_epi32(cmp0, cmp1), _mm_packs_epi32(cmp2, cmp3));
int mask = _mm_movemask_epi8(cmp);
if (mask)
return i + __builtin_ctz(mask);
}
}
This is basically what the gesa described, but I added a special first iteration to align the data for the main loop. Loads that cross cache line boundaries (or page boundaries) are slower, this saves them. The overhead does not cost short distances (under low enough load), on the other hand, it should be faster for short distances (less than 4).
I also tried to reverse the condition ( linSIMD5
) using (a >= b) = !(b > a)
, with non-destructive AVX encoding, which would allow for the vcmpgtd
concatenation and load (reducing ΞΌops in the fused domain), but fast -bench doesn't do AVX, so just ignore the result and try it yourself.
There's AVX2 version at the bottom, I haven't tried or compared it. It doesn't use the load / compare-merging trick (which may or may not help), but it's easy to adapt.
source to share
Use larger batches in a loop in a SIMD package.
For example, use comparison on 4 SIMD registers and then output the result of comparison 16 into one SIMD register. Then put a branch on that (and step out of the loop if a match is found). This way you will have:
- fewer branches
- more possible parallelization capability for compiler and processor
After you step out of the loop, you need to find the match index among the 16 possible records. You can do this with SIMD or whatever method you prefer.
This path should be faster than your current implementation (for large arrays).
source to share