Compare two 16 byte values ​​for equality using SSE 4.2?

I have a structure like this:

struct {
    uint32_t a;
    uint16_t b;
    uint16_t c;
    uint16_t d;
    uint8_t  e;
} s;

      

and I would like to compare two of the above structures as quickly as possible for equality. I looked at Intel Intrinsics Guide but couldn't find a comparison for integers. The options available were mostly dual and unidirectional vector inputs.

Can someone please advise on a better approach? I can add a union to my structure to make it easier to handle.

I am limited (for now) to using SSE4.2, but any AVX answers are also welcome if significantly faster. I am using GCC 4.8.2

+3


source to share


2 answers


What @ zx485 should have written:

.data
  mask11byte db 0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0,0,0,0,0
.code
  pxor xmm1, xmm2  ; equiv to psubb, but runs on all 3 vector execution ports
  ptest xmm1, xmmword ptr [mask11byte]   ; SSE 4.1
  setz al     ; AL=TRUE for equal

      

As long as nothing bad happens (floating point exceptions), you don't need to mask your operands before computation, even if they contain garbage. And since it PTEST

does bitwise AND as part of its job, you don't need a separate one PAND

at all.

For a while, I thought I had a version that could use less space and less uops, but I had to add an extra instruction because there isn't pcmpneq

(so I need a boolean not

). Thus, it is smaller, the same number of uops, but significantly worse latency.

.code
  PCMPEQB xmm1, xmm2  ; bytes of xmm1 = 0xFF on equal
  PMOVMSKB eax, xmm1  ; ax = high bit of each byte of xmm1
  NOT eax
  TEST eax, 0x7FF  ; zero flag set if all the low 11 bits are zero
  SETZ al    ; 17 bytes

; Or one fewer insn with BMI1 ANDN.  One fewer uop if test can't macro-fuse
  ANDN eax, eax, [mask11bits]   ; only test the low 11 bits.
;  ANDN version takes 20 bytes, plus 2B of data
.data
  mask11bits dw 07ffh

      

test

can use a macro fuse with jcc

, so if you use that as a jump condition instead of actually executing setz

, you step out ahead in size. (since you don't need a 16B permanent mask.)



PTEST

takes 2 uops, so version PTEST

is 4 uops total (including jcc

or other instruction). The version pmovmskb

also has 4 uops with the macroblock of the test

/ branch jcc

, but 5 with cmovcc

/ setcc

. (4s andn

, with any of setcc

/ cmovcc

/ jcc

, since it cannot use the macro fuse.)

(Agner Fog table says it PTEST

accepts 1 fusible uop domain on Sandybridge, 2 on all other Intel processors that support it. I'm not sure what I believe, however.)

Latency on Hasuela (important if the branch doesn't predict well):

  • pxor

    : 1 + PTEST

    : 2 = 3 cycles
  • pcmpeqb

    : 1 + pmovmskb

    : 3 + not

    : 1 + test

    : 1 = 6 cycles
  • pcmpeqb

    : 1 + pmovmskb

    : 3 + andn

    : 1 = 5 cycles (but not macro configured, maybe 1 more latency cycle?)

Thus, the version PTEST

has significantly lower latency: it jcc

can execute earlier to detect branch false predictions earlier.

Agner Fog tests show it PTEST

has latency = 3 on Nehalem, 1 on SnB / IvB, 2 on Haswell.

+2


source


A simple solution is to simply subtract from the two structural bytes after masking so that you only get all zero if all packed bytes are identical. This code is in MASM format, but you can probably adapt it to AT&T gcc syntax, or essentially:

.data
  mask11byte db 0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0,0,0,0,0
.code
  pand  xmm1, xmmword ptr [mask11byte]
  pand  xmm2, xmmword ptr [mask11byte]
  psubb xmm1, xmm2
  ptest xmm1, xmm1   ; SSE 4.1
  setz al     ; AL=TRUE for equal

      



Addition: Since the structure size is 11 bytes, 256bit / 32 bytes-AVX (x) is meaningless.

0


source







All Articles