What's going on with gcc's strange stack management in this simple function?

Question

What's going on with gcc's strange stack management in this simple function?

I've seen this r10

oddity a few times, so let's see if anyone knows what's going on.

Let's take this simple function:

#define SZ 4

void sink(uint64_t *p);

void andpop(const uint64_t* a) {
    uint64_t result[SZ];
    for (unsigned i = 0; i < SZ; i++) {
        result[i] = a[i] + 1;
    }

    sink(result);
}

It just adds 1 to each of the 4 64-bit elements of the passed array and stores it locally and calls sink()

as a result (to avoid optimizing the entire function).

Here's the relevant assembly:

andpop(unsigned long const*):
        lea     r10, [rsp+8]
        and     rsp, -32
        push    QWORD PTR [r10-8]
        push    rbp
        mov     rbp, rsp
        push    r10
        sub     rsp, 40
        vmovdqa ymm0, YMMWORD PTR .LC0[rip]
        vpaddq  ymm0, ymm0, YMMWORD PTR [rdi]
        lea     rdi, [rbp-48]
        vmovdqa YMMWORD PTR [rbp-48], ymm0
        vzeroupper
        call    sink(unsigned long*)
        add     rsp, 40
        pop     r10
        pop     rbp
        lea     rsp, [r10-8]
        ret

It is difficult to understand almost everything that happens with r10

. First, is r10

set to rsp + 8

, then push QWORD PTR [r10-8]

, which, as far as I can tell, pushes a copy of the return address on the stack. After that it is rbp

installed as usual and then finally r10

.

To expand the whole thing, it r10

is popped off the stack and used to restore rsp

to its original value.

Some observations:

Looking at the whole function, it all looks like a completely roundabout way to just restore the rsp

original value to before ret

, but a regular epilogue mov rsp, rpb

will do the same (see clang

)
That said, (dear) push QWORD PTR [r10-8]

doesn't even help on this mission: this value (return address?) Is apparently never used.
Why r10

clicked and popped out? The value does not go astray in a very small function body and there is no pressure in the register.

What's with that? I've seen this a few times before and usually he wants to use r10

it sometimes r13

. It looks like it has to do with aligning the stack to 32 bytes, since if you change SZ

to less than 4 it will use xmm

ops and the problem goes away.

Here SZ == 2

for example:

andpop(unsigned long const*):
        sub     rsp, 24
        vmovdqa xmm0, XMMWORD PTR .LC0[rip]
        vpaddq  xmm0, xmm0, XMMWORD PTR [rdi]
        mov     rdi, rsp
        vmovaps XMMWORD PTR [rsp], xmm0
        call    sink(unsigned long*)
        add     rsp, 24
        ret

Much nicer!

+3

compiler-optimization gcc x86

BeeOnRope 31 jul. 17 at 18:55

source to share

1 answer

Florian weimer · Answer 1 · 2017-07-31T19:08:07+0000

Well, you answered your question, the stack pointer must be 32-byte aligned before it can be accessed with low load and AVX2 stores, but the ABI only provides 16-byte alignment. Since the compiler cannot know how much alignment is off, it must store the stack pointer in register zero and restore it afterwards. But the stored value must survive the function call, so it must be pushed onto the stack and a stack frame must be created.

Some x86-64 ABIs have a red zone (the area of the stack below the stack pointer that is not used by signal handlers), so it is quite possible not to change the stack pointer at all for such short functions, but GCC apparently does not implement this optimization, and it will not apply here anyway because of the function call at the end.

Also, the default implementation of stack alignment is pretty poor. In this case -maccumulate-outgoing-args

results in prettier code with GCC 6:

andpop:
        pushq   %rbp
        movq    %rsp, %rbp
        andq    $-32, %rsp
        subq    $32, %rsp
        vmovdqu (%rdi), %xmm0
        vinserti128     $0x1, 16(%rdi), %ymm0, %ymm0
        movq    %rsp, %rdi
        vpaddq  .LC0(%rip), %ymm0, %ymm0
        vmovdqa %ymm0, (%rsp)
        vzeroupper
        call    sink@PLT
        leave
        ret

This issue (GCC generating bad code for aligning the stack) came up recently when we had to implement a workaround for the GCC __tls_get_addr

ABI bug and ended up doing the stack transfer manually.

EDIT There is another issue with the RTL dispatch order: aligning the stack before finally determining if the stack is really needed, as the second BeeOnRope example shows .

What's going on with gcc's strange stack management in this simple function?

More articles: