What's going on with gcc's strange stack management in this simple function?
I've seen this r10
oddity a few times, so let's see if anyone knows what's going on.
Let's take this simple function:
#define SZ 4
void sink(uint64_t *p);
void andpop(const uint64_t* a) {
uint64_t result[SZ];
for (unsigned i = 0; i < SZ; i++) {
result[i] = a[i] + 1;
}
sink(result);
}
It just adds 1 to each of the 4 64-bit elements of the passed array and stores it locally and calls sink()
as a result (to avoid optimizing the entire function).
Here's the relevant assembly:
andpop(unsigned long const*):
lea r10, [rsp+8]
and rsp, -32
push QWORD PTR [r10-8]
push rbp
mov rbp, rsp
push r10
sub rsp, 40
vmovdqa ymm0, YMMWORD PTR .LC0[rip]
vpaddq ymm0, ymm0, YMMWORD PTR [rdi]
lea rdi, [rbp-48]
vmovdqa YMMWORD PTR [rbp-48], ymm0
vzeroupper
call sink(unsigned long*)
add rsp, 40
pop r10
pop rbp
lea rsp, [r10-8]
ret
It is difficult to understand almost everything that happens with r10
. First, is r10
set to rsp + 8
, then push QWORD PTR [r10-8]
, which, as far as I can tell, pushes a copy of the return address on the stack. After that it is rbp
installed as usual and then finally r10
.
To expand the whole thing, it r10
is popped off the stack and used to restore rsp
to its original value.
Some observations:
- Looking at the whole function, it all looks like a completely roundabout way to just restore the
rsp
original value to beforeret
, but a regular epiloguemov rsp, rpb
will do the same (seeclang
) - That said, (dear)
push QWORD PTR [r10-8]
doesn't even help on this mission: this value (return address?) Is apparently never used. - Why
r10
clicked and popped out? The value does not go astray in a very small function body and there is no pressure in the register.
What's with that? I've seen this a few times before and usually he wants to use r10
it sometimes r13
. It looks like it has to do with aligning the stack to 32 bytes, since if you change SZ
to less than 4 it will use xmm
ops and the problem goes away.
Here SZ == 2
for example:
andpop(unsigned long const*):
sub rsp, 24
vmovdqa xmm0, XMMWORD PTR .LC0[rip]
vpaddq xmm0, xmm0, XMMWORD PTR [rdi]
mov rdi, rsp
vmovaps XMMWORD PTR [rsp], xmm0
call sink(unsigned long*)
add rsp, 24
ret
Much nicer!
source to share
Well, you answered your question, the stack pointer must be 32-byte aligned before it can be accessed with low load and AVX2 stores, but the ABI only provides 16-byte alignment. Since the compiler cannot know how much alignment is off, it must store the stack pointer in register zero and restore it afterwards. But the stored value must survive the function call, so it must be pushed onto the stack and a stack frame must be created.
Some x86-64 ABIs have a red zone (the area of ββthe stack below the stack pointer that is not used by signal handlers), so it is quite possible not to change the stack pointer at all for such short functions, but GCC apparently does not implement this optimization, and it will not apply here anyway because of the function call at the end.
Also, the default implementation of stack alignment is pretty poor. In this case -maccumulate-outgoing-args
results in prettier code with GCC 6:
andpop:
pushq %rbp
movq %rsp, %rbp
andq $-32, %rsp
subq $32, %rsp
vmovdqu (%rdi), %xmm0
vinserti128 $0x1, 16(%rdi), %ymm0, %ymm0
movq %rsp, %rdi
vpaddq .LC0(%rip), %ymm0, %ymm0
vmovdqa %ymm0, (%rsp)
vzeroupper
call sink@PLT
leave
ret
This issue (GCC generating bad code for aligning the stack) came up recently when we had to implement a workaround for the GCC __tls_get_addr
ABI bug and ended up doing the stack transfer manually.
EDIT There is another issue with the RTL dispatch order: aligning the stack before finally determining if the stack is really needed, as the second BeeOnRope example shows .
source to share