Difference between fetching memory with and without offset on Intel
Appel explains in " Time tags are not needed " on page 8 how to distinguish integers from pointers using tag pointers:
Some implementations use a low order tag of 0 for integers, and then integer addition can be done with the normal machine add command and no shift or correction will be needed (since 2x + 2y = 2 (x + y)). This requires pointers to be labeled 1; but the pointer samples can be taken at odd offsets to compensate.
The idea is this: if a pointer is aligned, the value will be a multiple of 2 or 4. In this case, the bottom 1 or 2 bits are always zero and can be set to some value to implement tags to distinguish integers from pointers.
Raw pointer fetch without offset in Intel syntax:
mov eax, DWORD PTR [ebx]
And the equivalent binding of an offset pointer is like this:
mov eax, DWORD PTR [ebx-0x1]
What is the difference in loops for the two sets?
source to share
The complexity of the addressing mode usually does not affect the throughput of the load instructions, but can have an impact of 1 cycle per latency 1 .
In particular, the simple addressing mode, which is [base]
or [base + offset]
, where it offset < 2048
usually takes 4 cycles, and the complex modes (something not simple) take 5 cycles. This is for general purpose register loads: for vector loads, you usually add 1 or 2 cycles.
So, in your case, you are only base
using a very small offset, so you should get the fastest load latency of 4 cycles.
This is for Intel, I'm not sure about AMD.
See the Intel Optimization Guide for details, but here's the source I could find most quickly.
As Ross points out in the comments, there is at least one other downside to using offset: the instruction is one byte longer for the offset version (and will be 4 more bytes if your offset is outside the -128 to 127 range), which increases the pressure slightly on icache.
1 It goes without saying that this is for hits in L1. If you skip L1 the latency will be longer - possibly much longer, and it probably doesn't matter if you still pay an extra cycle in this case (but I guess you, on average, since the miss doesn't start until as long as the address is calculated).
source to share