What is the floating point version (__m256d) of non-temporary streaming download (_mm256_stream_load_si256)?

In AVX / AVX2, I could find _mm256_stream_load_si256()

that for __m256i

. There is no way to stream download __m256d

and why? (I would like to load it without polluting the cpu cache)

Is there an obstacle to doing the following (aggressive casting)?

__m256d *pDest = /* ... */;
__m256d *pSrc = /* ... */;

/* ... */

const __m256i iWeight = _mm256_stream_load_si256(reinterpret_cast<const __m256i*>(pSrc));
const __m256d prior = _mm256_div_pd(*reinterpret_cast<const __m256d*>(&iWeight), divisor);
_mm256_stream_pd(reinterpret_cast<double*>(pDest), prior);

      

+3


source to share


1 answer


The internal function _mm256_stream_load_si256()

corresponds to the command (V)MOVNTDQA

. This is the only non-time-load instruction, so this is the one you should use even when you are loading floating point data.

(The other three non-temporary instructions store only stores: (V)MOVNTDQ

( _mm256_stream_si256

) for double 4-digit words, (V)MOVNTPS

( _mm256_stream_ps

) for single-precision floating-point values, and (V)MOVNTPD

( _mm256_stream_pd

) for double-precision floating point values.)

Listing from __m256i*

to __m256d*

, and vice versa, is safe. They are just bits and they are all stored in a register YMM

. I have never seen a compiler that has problems with these types of casts. You probably need to check the resulting assembly code to make sure it isn't doing something funky though!



The only time this is important is on some processors where there is a domain overflow penalty when you mix floating point SIMD instructions with whole SIMD instructions. But since the only NT boot is on an integer domain, you really have no choice.

Note that all non-temporary instructions (downloads and repositories) require aligned addresses!

+5


source







All Articles