Big run-to-run variable shown by copy scheme implemented with MOVDQU

Question

Big run-to-run variable shown by copy scheme implemented with MOVDQU

I'm looking for an explanation of the results I see in a loop that moves 64 bytes per iteration, from some source memory location to some destination memory location, using the x86 movdqu command (movdqu command supports moving 16 byte data from / to xmm registers to / from unmanaged memory locations). This is a piece of code that implements a function similar to memcpy () / java.lang.System.arraycopy ()

There are two different patterns I tried to implement with:

Pattern1

0x30013f74: prefetchnta BYTE PTR [rsi]
0x30013f77: prefetchnta BYTE PTR [rdi]
0x30013f7a: movdqu xmm3, XMMWORD PTR [rsi+0x30]
0x30013f7f: movdqu xmm2, XMMWORD PTR [rsi+0x20]
0x30013f84: movdqu XMMWORD PTR [rdi+0x30],xmm3
0x30013f89: movdqu XMMWORD PTR [rdi+0x20],xmm2
0x30013f8e: movdqu xmm1, XMMWORD PTR [rsi+0x10]
0x30013f93: movdqu xmm0, XMMWORD PTR [rsi]
0x30013f97: movdqu XMMWORD PTR [rdi+0x10], xmm1
0x30013f9c: movdqu XMMWORD PTR [rdi], xmm0

In this pattern, rsi contains the source address (src), rdi contains the destination address (dst), and the xmm registers are used as temporary registers. This code is repeated as many times as copylen_in_bytes / 64. As you can see, the ld-ld-st-st-ld-ld-st-st-load-store pattern follows here.

Pattern2

0x30013f74: prefetchnta BYTE PTR [rsi]
0x30013f77: prefetchnta BYTE PTR [rdi]
0x30013f7a: movdqu xmm3, XMMWORD PTR [rsi+0x30]
0x30013f7f: movdqu XMMWORD PTR [rdi+0x30], xmm3
0x30013f84: movdqu xmm2, XMMWORD PTR [rsi+0x20]
0x30013f89: movdqu XMMWORD PTR [rdi+0x20], xmm2
0x30013f8e: movdqu xmm1, XMMWORD PTR [rsi+0x10]
0x30013f93: movdqu XMMWORD PTR [rdi+0x10], xmm1
0x30013f98: movdqu xmm0, XMMWORD PTR [rsi]
0x30013f9c: movdqu XMMWORD PTR [rdi], xmm0

Sample 2 uses the ld-st-ld-st-ld-st-ld-st pattern.

Observations

When running this code a few hundred times with src and dst aligned on different 8-byte boundaries, I observe the following:

On Westmere (Xeon X5690)

Pattern 1 exhibits very high run-to-run variance.

Sample 2 has practically no dispersion.

The minimum time (fastest observed time) on Pattern2 is higher (~ 8%) than the minimum time on Pattern1.

In Ivybridge (Xean E5-2697 v2)

Pattern 1 exhibits very high run-to-run variance.

Sample 2 has practically no dispersion.

The minimum time on Pattern2 is higher (~ 20%) than the minimum time on Pattern1.

Haswell (Core i7-4770)

Sample 1 does NOT demonstrate very high run-to-run variance.

Sample 2 is practically the same.

The minimum time on Pattern2 is higher (~ 20%) than the minimum time on Pattern1.

Ironically, on Westmere and Ivybrig, there seems to be no correlation between src / dest alignment and poor results (which cause high variance). I see good and bad numbers for the same src / dest alignment.