Difference between fetching memory with and without offset on Intel

Appel explains in " Time tags are not needed " on page 8 how to distinguish integers from pointers using tag pointers:

Some implementations use a low order tag of 0 for integers, and then integer addition can be done with the normal machine add command and no shift or correction will be needed (since 2x + 2y = 2 (x + y)). This requires pointers to be labeled 1; but the pointer samples can be taken at odd offsets to compensate.

The idea is this: if a pointer is aligned, the value will be a multiple of 2 or 4. In this case, the bottom 1 or 2 bits are always zero and can be set to some value to implement tags to distinguish integers from pointers.

Raw pointer fetch without offset in Intel syntax:

mov    eax, DWORD PTR [ebx]

      

And the equivalent binding of an offset pointer is like this:

mov    eax, DWORD PTR [ebx-0x1] 

      

What is the difference in loops for the two sets?

+3


source to share


1 answer


The complexity of the addressing mode usually does not affect the throughput of the load instructions, but can have an impact of 1 cycle per latency 1 .

In particular, the simple addressing mode, which is [base]

or [base + offset]

, where it offset < 2048

usually takes 4 cycles, and the complex modes (something not simple) take 5 cycles. This is for general purpose register loads: for vector loads, you usually add 1 or 2 cycles.

So, in your case, you are only base

using a very small offset, so you should get the fastest load latency of 4 cycles.

This is for Intel, I'm not sure about AMD.



See the Intel Optimization Guide for details, but here's the source I could find most quickly.

As Ross points out in the comments, there is at least one other downside to using offset: the instruction is one byte longer for the offset version (and will be 4 more bytes if your offset is outside the -128 to 127 range), which increases the pressure slightly on icache.


1 It goes without saying that this is for hits in L1. If you skip L1 the latency will be longer - possibly much longer, and it probably doesn't matter if you still pay an extra cycle in this case (but I guess you, on average, since the miss doesn't start until as long as the address is calculated).

+3


source







All Articles