Why is there no built-in floating point function for the PSHUFD instruction?

Question

Why is there no built-in floating point function for the PSHUFD instruction?

The challenge I'm facing is to shuffle the vector one _m128

and store the result in the other.

As I see it, there are two main ways to shuffle a packed floating point _m128

vector:

_mm_shuffle_ps

which uses an instruction SHUFPS

, which is not necessarily the best option if you only want values from just one vector: it takes two values from the destination operand, which implies an extra move.
_mm_shuffle_epi32

which uses an instruction PSHUFD

that seems to do exactly what is expected here and may have better latency / throughput than SHUFPS

.

The latter, however, works with integer vectors ( _m128i

) and there seems to be no floating point counterpart, so using it with _m128

would require some ugly explicit casting. Also the fact that there is no such analogue probably means that there is some own reason for this, which I am not aware of.

The question is, why is there no way to shuffle one floating point vector and store the result in another?
If it _mm_shuffle_ps(x,x, ...)

can generate PSHUFPD

, can it be guaranteed?
If PSHUFD

shouldn't be used for floating point values, what is the reason for this?

Thank!

+2

c ++ assembly vectorization sse intrinsics

Ap31 Apr 19 17 at 12:10

source to share

1 answer

icecreamsword · Accepted Answer · 2017-04-19T17:27:12+0000

The internals are supposed to match against each other with instructions. It would be highly undesirable for _mm_shuffle_ps to generate PSHUFD. It should always generate SHUFPS. The documentation does not suggest that there is a case where this would be done otherwise.

Some processors experience performance degradation when data is transferred in double or double precision floating point. This is because the processor augments SSE registers with internal registers containing the FP data classification, for example. zero or NaN or infinity or normal. When switching types, you click on the stall as he does this step. I don't know if this is true for modern processors, but you can check out Intel Architecture Optimization Guides for this information.

SHUFPS is not much slower than PSHUFD on modern processors. According to Agner Fog's instruction tables ( http://www.agner.org/optimize/instruction_tables.pdf ) they have identical latency and throughput on Haswell (4th gene of Core i7). On Nehalem (1st gen Core i7) they have identical latency, but PSHUFD has 2 / cycle bandwidth and SHUFPS has 1 / cycle bandwidth. Thus, you cannot say that one instruction should be preferred over all processors, even if you ignore the performance limitation associated with switch types.

There is also a way to cast between __m128, __m128d and __m128i: _mm_castXX_YY ( https://software.intel.com/en-us/node/695375?language=es ) where XX and YY are each one of ps, pd, or si128. For example _mm_castps_pd (). This is really a bad idea, because processors that are faster than PSHUFD suffer from the performance degradation associated with switching to FP later. In other words, there is no faster way to do SHUFPS other than doing SHUFPS.

Why is there no built-in floating point function for the PSHUFD instruction?

More articles: