Why is TZCNT running on my Sandy Bridge processor?

Question

Why is TZCNT running on my Sandy Bridge processor?

I am running a Core i7 3930k which has the Sandy Bridge microarchitecture. When running the following code (compiled under MSVC19, VS2015), the results surprised me (see comments):

int wmain(int argc, wchar_t* argv[])
{
    uint64_t r = 0b1110'0000'0000'0000ULL;
    uint64_t tzcnt = _tzcnt_u64(r);
    cout << tzcnt << endl; // prints 13

    int info[4]{};
    __cpuidex(info, 7, 0);
    int ebx = info[1];
    cout << bitset<32>(ebx) << endl; // prints 32 zeros (including the bmi1 bit)

    return 0;
}

Disassembly shows that the command is tzcnt

indeed emitted from the inline:

    uint64_t r = 0b1110'0000'0000'0000ULL;
00007FF64B44877F 48 C7 45 08 00 E0 00 00 mov         qword ptr [r],0E000h  
    uint64_t tzcnt = _tzcnt_u64(r);
00007FF64B448787 F3 48 0F BC 45 08    tzcnt       rax,qword ptr [r]  
00007FF64B44878D 48 89 45 28          mov         qword ptr [tzcnt],rax

Why am I not getting an invalid opcode exception #UD

, the instruction is working correctly, and the CPU is reporting that it does not support the above instruction?

Could it be some weird version of microcode that contains an implementation for the instruction but doesn't report support for it (and others included in bmi1

)?

I haven't checked the rest of the instructions bmi1

, but I'm wondering how common this phenomenon is.

+3

assembly x86 x86-64

Doommuffins May 09 '17 at 21:34

source to share

1 answer

Johan · Accepted Answer · 2017-08-10T13:46:28+0000

It seems that Sandy Bridge processors (and earlier versions) support lzcnt

, and tzcnt

the fact that both teams are backward compatible encoding.

lzcnt eax,eax  = rep bsr eax,eax
tzcnt eax,eax  = rep bsf eax,eax

On older processors, the prefix is rep

simply ignored.

So much for good news.
The bad news is that the semantics of both versions are different.

lzcnt eax,zero => eax = 32, CF=1, ZF=0  
bsr eax,zero   => eax = undefined, ZF=1
lzcnt eax,0xFFFFFFFF => eax=0, CF=0, ZF=1   //dest=number of msb leading zeros
bsr eax,0xFFFFFFFF => eax=31, ZF=0        //dest = bit index of highest set bit


tzcnt eax,zero => eax = 32, CF=1, ZF=0
bsf eax,zero   => eax = undefined, ZF=1
tzcnt eax,0xFFFFFFFF => eax=0, CF=0, ZF=1   //dest=number of lsb trailing zeros
bsf eax,0xFFFFFFFF => eax=0, ZF=0        //dest = bit index of lowest set bit

Not less bsf

and tzcnt

generate the same result when source is <> 0 bsr

and lzcnt

disagree with it.
Also, lzcnt

and tzcnt

it is much faster than bsr

/ bsf

.
It totally sucks that bsf

u tzcnt

can't agree to use the flag. This unnecessary inconsistency means that I cannot use it tzcnt

as a replacement for bsf

unless I can make sure its source is nonzero.

Why is TZCNT running on my Sandy Bridge processor?

More articles: