The compiler generates an expensive MOVZX instruction

Question

The compiler generates an expensive MOVZX instruction

My profiler identified the following feature profiling as a hot spot.

typedef unsigned short ushort;

bool isInteriorTo( const std::vector<ushort>& point , const ushort* coord , const ushort dim )
{
    for( unsigned i = 0; i < dim; ++i )
    {
        if( point[i + 1] >= coord[i] ) return false;
    }

    return true;  
}

In particular, the bulk of the runtime is responsible for the assembly MOVZX

( Move with Zero-Extend ). The if statement is compiled to

mov     rcx, QWORD PTR [rdi]
lea     r8d, [rax+1]
add     rsi, 2
movzx   r9d, WORD PTR [rsi-2]
mov     rax, r8
cmp     WORD PTR [rcx+r8*2], r9w
jae     .L5

I would like to persuade the compiler not to generate this instruction, but I guess first I need to understand why this instruction is being generated. Why expansion / null expansion, given that I'm working with the same datatype?

(Find the whole function in explorer explorer .)

+3

c ++ optimization assembly profiling

Olumide Apr 19 17 at 9:32 am

source to share

2 answers

Thanks for the good question!

Clearing registers and dependencies Breaking Idioms

Quoting from Intel® 64 and IA-32 Architecture Reference Optimization Guide , section 3.5.1.8:

Sequences of codes that change partial case may experience some delay in their dependency chain, but can be avoided by using dependency breaking idioms. In processors based on the Intel Core microarchitecture, a series of instructions can help eliminate execution dependencies when software uses the instruction to clear the contents of a register to zero. Separate part-register dependencies between instructions by operating in 32-bit registers instead of partial registers. For moves, this can be done with 32-bit moves or with MOVZX.

Build / Compiler Rule. Rule 37. (M impact, MH generality) . Separate part-register dependencies between instructions by working with 32-bit registers instead of partial registers. For moves, this can be done with 32-bit movements or with MOVZX.

movzx vs mov

The compiler knows movzx is not expensive and thus uses it as often as possible. Movzx may take more bytes to encode than mov, but it shouldn't be done.

Unlike logic, a program with movzx (which fills all the registers) is actually faster than using movs alone, which only sets the lower parts of the registers.

Let me demonstrate this output to you with the following piece of code:

    movzx   ecx, bl
    shr     ebx, 8
    mov     eax, dword ptr [ecx * 4 + edi + 1024 * 3]

    movzx   ecx, bl
    shr     ebx, 8
    xor     eax, dword ptr [ecx * 4 + edi + 1024 * 2]

    movzx   ecx, bl
    shr     ebx, 8
    xor     eax, dword ptr [ecx * 4 + edi + 1024 * 1]

    skipped 6 more similar triplets that do movzx, shr, xor.

    dec     <<<a counter register >>>>
    jnz     …… <<repeat the whole loop again>>>

Here is the second piece of code, we cleared ecx beforehand and now instead of "movzx ecx, bl" we do "mov cl, bl":

    // ecx is already cleared here to 0

    mov     cl, bl
    shr     ebx, 8
    mov     eax, dword ptr [ecx * 4 + edi + 1024 * 3]

    mov     cl, bl
    shr     ebx, 8
    xor     eax, dword ptr [ecx * 4 + edi + 1024 * 2]

    mov     cl, bl
    shr     ebx, 8
    xor     eax, dword ptr [ecx * 4 + edi + 1024 * 1]

    <<< and so on – as in the example #1>>>

Now guess which of the above two code snippets is faster? Did you think earlier that the speed is the same or the movzx version is slower? Actually the movzx code is faster because all Pentium Pro processors perform Out-of-order command execution and register renaming.

Rename registration

Rename registration is a technique used internally by the processor that eliminates false data dependencies resulting from reuse of registers by using sequential instructions that do not have any real dependencies between them.

Let me just take the first 4 commands from the first piece of code:

movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx ecx, bl

As you can see, instruction 4 depends on instruction 2. Instruction 4 does not depend on the result of instruction 3. Thus, the processor could execute instructions 3 and 4 in parallel (together), but instruction 3 uses a register (read-only) modified by instruction 4 so command 4 can only be run after statement 3 completes. Let's rename the ecx register to edx after the first triplet to avoid this dependency:

    movzx   ecx, bl
    shr     ebx, 8
    mov     eax, dword ptr [ecx * 4 + edi + 1024 * 3]

    movzx   edx, bl
    shr     ebx, 8
    xor     eax, dword ptr [edx * 4 + edi + 1024 * 2]

    movzx   ecx, bl
    shr     ebx, 8
    xor     eax, dword ptr [ecx * 4 + edi + 1024 * 1]

Here's what we have now:

movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx edx, bl

Now instruction 4 does not in any way use the register required for instruction 3, and vice versa, so instructions 3 and 4 can be executed simultaneously!

This is what the CPU does for us. The CPU, when translating instructions into micro-ops, which will be executed by the Out-of-order algorithm, renames the registers internally to eliminate these dependencies, so the micro-ops deal with the renamed internal registers rather than the real ones as we know them. This way we don't need to rename the registers ourselves, as I just renamed in the above example - the processor will automatically rename everything for us, translating instructions into micro-ops.

The micro-ops of instruction 3 and instruction 4 will execute in parallel, since the micro-ops of instruction 4 will deal with a completely different internal register (exposed outward as ecx) than the micro-ops of instruction 3, so we don't need to rename anything.

Let me revert the code back to its original version. Here he is:

movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx ecx, bl

(instructions 3 and 4 are executed in parallel, because ecx of instruction 3 is not ecx by instruction 4, but different, renamed register - the CPU automatically assigned a new fresh register from the pool of internal registers for micro-op instruction 4).

Now back to movxz vs mov.

Movzx clears the register completely, so the CPU is confident that we are not dependent on the previous value that was left in the higher bits of the register. When the CPU sees the movxz instruction, it knows that it can safely rename the register internally and execute the instruction in parallel with the previous instructions. Now take first 4 instructions from our example # 2, where we use mov, not movzx:

mov cl, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
mov cl, bl

In this case, command 4, by modifying cl, changes bits 0-7 of ecx, leaving bits 8-32 unchanged. Thus, the CPU cannot simply rename the register for instruction 4 and allocate another new register, since instruction 4 depends on bits 8-32 left over from previous instructions. The CPU has to store bits 8-32 before it can execute instruction 4, so it can't just rename the register. It will wait for instruction 3 to complete before executing instruction 4. Instruction 4 is not completely independent - it depends on the previous ECX value andthe previous value of bl. So it depends on two registries at once. If we were to use movzx, it would only depend on one register - bl. As a consequence, instructions 3 and 4 will not run in parallel due to their interdependencies. Sad but true.

This is why it is always faster to work with full registers - and if we only need to modify part of a register - it is always faster to change a full register (for example, using movzx) - so that the CPU knows for sure that the register is no longer dependent on its previous value. Changing the complete registers allows the processor to rename the register and allow the out-of-order execution algorithm to execute this instruction alongside other instructions rather than executing them one at a time.

+3

Maxim masiutin May 11 '17 at 8:57

source to share

fuz · Accepted Answer · 2017-04-19T09:51:37+0000

The instruction note movzx

increases the amount in the larger register. In your case, a word (two bytes) is zero multiplied by dword (four bytes). Usually zero expansion is free, the slower part loads the memory operand WORD PTR [rsi-2]

from RAM.

To speed this up, you can try to make sure the zero value you want to retrieve from RAM is in the L1 cache at the time you need it. You can do this by putting strategic prefetch prefixes in the appropriate place. For example, assuming one cache line is 64 bytes, you can add an internal prefetch structure to fetch an array entry i + 32

every time you loop through.

You can also consider an algorithmic improvement so that less data needs to be retrieved from memory, but this is hardly possible.

The compiler generates an expensive MOVZX instruction

Clearing registers and dependencies Breaking Idioms

movzx vs mov

Rename registration

More articles: