Converting float to uint64 and uint32 behaving strange
When I convert a 32-bit float to a 64-bit unsigned integer in C ++ everything works as expected. An overflow causes the FE_OVERFLOW flag to be set (cfenv) and return 0.
std::feclearexcept(FE_ALL_EXCEPT);
float a = ...;
uint64_t b = a;
std::fexcept_t flags;
std::fegetexceptflag(&flags, FE_ALL_EXCEPT);
But when I convert 32 bit float to 32 bit unsigned integer like:
std::feclearexcept(FE_ALL_EXCEPT);
float a = ...;
uint32_t b = a;
std::fexcept_t flags;
std::fegetexceptflag(&flags, FE_ALL_EXCEPT);
I behave exactly the same as the 64-bit conversion, except that the top 32-bit part is truncated. It is equal to:
std::feclearexcept(FE_ALL_EXCEPT);
float a = ...;
uint64_t b2 = a;
uint32_t b = b2 & numeric_limits<uint32_t>::max();
std::fexcept_t flags;
std::fegetexceptflag(&flags, FE_ALL_EXCEPT);
So the overflow only occurs if the exponent is greater than or equal to 64 and between the exponent 32 and 64 it returns the lower 32-bit version of the 64-bit conversion without setting the overflow. This is very strange because you would expect it to overflow at 32.
Is this how it should be or am I doing something wrong?
Compiler: LLVM version 6.0 (clang-600.0.45.3) (based on LLVM 3.5svn)
source to share
Floating point to undefined integer overflow behavior . You cannot rely on doing this with a single build command, or with an instruction that overflows for the exact set of values โโyou want to set the overflow flag for.
The assembly instruction cvttsd2si
that was probably generated does set flags when it overflows, but a 64-bit version of the instruction can be generated when converted to a 32-bit int. A good reason is to truncate the floating point value to an unsigned 32-bit integer , as in your question, since all 32 low-order bits of the destination register are set correctly which results in the conversion being determined after the 64-bit signed instruction is executed . There is no unsigned command option cvttsd2si
.
From Intel's guide :
CVTTSD2SI-Truncate Conversion FP Double Precision Scalar for Signing Integers
...
If the converted result exceeds the range of a signed doubleword integer (in non-64-bit modes or 64-bit mode with REX.W / VEX.W = 0), the invalid floating point exception is incremented and if this exception is masked, it is returned undefined integer value (80000000H).
If the converted result exceeds the range of the signed integer quadword (in 64-bit mode and REX.W / VEX.W = 1), the invalid floating point exception is incremented, and if this exception is masked, an undefined integer value (80000000_00000000H) is returned.
This blog post , although it is for C, expands on this.
source to share