Compare two 16 byte values ββfor equality using SSE 4.2?
I have a structure like this:
struct {
uint32_t a;
uint16_t b;
uint16_t c;
uint16_t d;
uint8_t e;
} s;
and I would like to compare two of the above structures as quickly as possible for equality. I looked at Intel Intrinsics Guide but couldn't find a comparison for integers. The options available were mostly dual and unidirectional vector inputs.
Can someone please advise on a better approach? I can add a union to my structure to make it easier to handle.
I am limited (for now) to using SSE4.2, but any AVX answers are also welcome if significantly faster. I am using GCC 4.8.2
source to share
What @ zx485 should have written:
.data
mask11byte db 0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0,0,0,0,0
.code
pxor xmm1, xmm2 ; equiv to psubb, but runs on all 3 vector execution ports
ptest xmm1, xmmword ptr [mask11byte] ; SSE 4.1
setz al ; AL=TRUE for equal
As long as nothing bad happens (floating point exceptions), you don't need to mask your operands before computation, even if they contain garbage. And since it PTEST
does bitwise AND as part of its job, you don't need a separate one PAND
at all.
For a while, I thought I had a version that could use less space and less uops, but I had to add an extra instruction because there isn't pcmpneq
(so I need a boolean not
). Thus, it is smaller, the same number of uops, but significantly worse latency.
.code
PCMPEQB xmm1, xmm2 ; bytes of xmm1 = 0xFF on equal
PMOVMSKB eax, xmm1 ; ax = high bit of each byte of xmm1
NOT eax
TEST eax, 0x7FF ; zero flag set if all the low 11 bits are zero
SETZ al ; 17 bytes
; Or one fewer insn with BMI1 ANDN. One fewer uop if test can't macro-fuse
ANDN eax, eax, [mask11bits] ; only test the low 11 bits.
; ANDN version takes 20 bytes, plus 2B of data
.data
mask11bits dw 07ffh
test
can use a macro fuse with jcc
, so if you use that as a jump condition instead of actually executing setz
, you step out ahead in size. (since you don't need a 16B permanent mask.)
PTEST
takes 2 uops, so version PTEST
is 4 uops total (including jcc
or other instruction). The version pmovmskb
also has 4 uops with the macroblock of the test
/ branch jcc
, but 5 with cmovcc
/ setcc
. (4s andn
, with any of setcc
/ cmovcc
/ jcc
, since it cannot use the macro fuse.)
(Agner Fog table says it PTEST
accepts 1 fusible uop domain on Sandybridge, 2 on all other Intel processors that support it. I'm not sure what I believe, however.)
Latency on Hasuela (important if the branch doesn't predict well):
-
pxor
: 1 +PTEST
: 2 = 3 cycles -
pcmpeqb
: 1 +pmovmskb
: 3 +not
: 1 +test
: 1 = 6 cycles -
pcmpeqb
: 1 +pmovmskb
: 3 +andn
: 1 = 5 cycles (but not macro configured, maybe 1 more latency cycle?)
Thus, the version PTEST
has significantly lower latency: it jcc
can execute earlier to detect branch false predictions earlier.
Agner Fog tests show it PTEST
has latency = 3 on Nehalem, 1 on SnB / IvB, 2 on Haswell.
source to share
A simple solution is to simply subtract from the two structural bytes after masking so that you only get all zero if all packed bytes are identical. This code is in MASM format, but you can probably adapt it to AT&T gcc syntax, or essentially:
.data
mask11byte db 0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0,0,0,0,0
.code
pand xmm1, xmmword ptr [mask11byte]
pand xmm2, xmmword ptr [mask11byte]
psubb xmm1, xmm2
ptest xmm1, xmm1 ; SSE 4.1
setz al ; AL=TRUE for equal
Addition: Since the structure size is 11 bytes, 256bit / 32 bytes-AVX (x) is meaningless.
source to share