How are floating point numbers stored internally by the CPU?

I get started and am good with assembly. Now while reading this question I came to this paragraph. He explains how floating point numbers are stored in memory.

The exponent for float is an 8-bit field. To allow large numbers or small numbers to be stored, the exponent is interpreted as positive or negative. The actual value is the value of the 8-bit field minus 127. 127 is the "exponent offset" for 32-bit floating point numbers. A small surprise is the float fraction field. Since 0.0 is defined since all bits are set to 0, there is no need to worry about representing 0.0 as an exponent field of 127 and a fraction field set to all O. All other numbers have at least one bit, so the IEEE 754 format uses implicit 1 bit to save space. Therefore, if the fraction field is 00000000000000000000000, this is interpreted as 1.00000000000000000000000.This allows the fraction of the field to be effectively 24 bits. This is a clever trick done by creating the OxOO and OxFF exponent fields.

I don't get it at all.

Can you explain to me how they are stored in memory? I don't need links, I just need a good explanation so that I can understand easily.

+4


source to share


2 answers


Floating point numbers comply with the IEEE754 standard . They used this set of rules mainly because floating point numbers can (relatively) easily be compared to integers and other floating point numbers.

There are two common versions of floating point numbers: 32-bit ( IEEE binary32, also known asfloat

) and 64-bit (64- binary, also known as double

precision
). The only difference between the two is the size of their fields:

  • exponent: 8 bit for 32 bit, 11 bit for 64 bit
  • mantissa: 23 bits for 32 bits, 52 bits for 64 bits

There's an extra bit, the sign bit, that indicates whether the number in question is positive or negative.

Now let's take 12,375

radix 10 (32 bits) as an example :

  • The first step is to convert this number to base 2: this is pretty easy, after some calculations you will have: 1100.011

  • Then you must move the "comma" around until you get it 1.100011

    (until the single digit in front .

    is 1). How many times do we move the comma? 3 is an exponent. This means that our number can be represented as 1.100011*2^3

    . (It is not called a decimal point because it is a binary point. It is a "pivot point" or a "binary point".)

    Moving .

    around (and counting these movements with exponent) until the mantissa starts with the leader 1.

    is called "normalizing". A number that is too small for such a representation (limited range of the exponent) is called a subnormal or abnormal number.

  • After that we have to add an offset to the exponent. This is 127 for an 8-bit exponent field in 32-bit floating point numbers. Why should we do this? The answer is: because in this way we can more easily compare floating point numbers with integers. (Comparing FP bit patterns as an integer shows which one has the highest value if they have the same sign.) In addition, increasing the bit pattern (including transfer from the mantissa to the exponent) increases the magnitude to the next representable value. ( nextafter()

    )

    If we didn't, the negative exponent would be represented in two-part notation, essentially putting 1 in the most significant bit. But in this case, the smaller floating point appears to be larger than the positive exponent floating point. For this reason: we just add 127, with this little "trick" all positive readings start at 10000000

    base 2 (which is 1 base 10), while negative readings reach no more than 01111110

    base 2 (which is -1 base 10) ...

In our example, the normalized indicator is 10000010

base 2.

  • The last thing to do is add mantix ( .100011

    ) after the exponent, the result is:

    01000001010001100000000000000000
     |  exp ||      mantix         |
    
          

(the first bit is the sign bit)



There is a good online converter that renders the bits of a 32-bit floating point number and shows the decimal number it represents. You can change any and it updates the other. https://www.h-schmidt.net/FloatConverter/IEEE754.html


This was a simple version, which is a good start. To simplify it by skipping:

  • Not-A-number NaN (biased exponent = all ones; mantissa! = 0)
  • + -Infinity (biased metric = all units; mantissa = 0)
  • and said nothing about subnormal numbers (biased exponent = 0 implies leading 0 in mantissa instead of normal 1).

The Wikipedia articles on single and double precision are excellent, with diagrams and plenty of explanations for angular cases and details. See them for full details.

In addition, some (mostly historical) computers use FP formats other than IEEE-754.

And there are other IEEE-754 formats like 16-bit half precision, and one well-known extended precision format is 80-bit x87, which stores the first 1 of the meaningful and explicit, instead of the implied zero or non-zero exponent.

IEEE-754 even defines some floating point decimal formats, using 10 ^ exp to represent decimal fractions exactly instead of binary fractions. (HW support for them is limited, but exists).

+5


source


Nothing different from high school math. We went to high school to make positive integers, add first, subtract it all. Then we learned how to make a horizontal scratch that represented the minus sign and indicated negative numbers, and learned about the number line and we couldn't go netagive. Thus, the presence of a negative sign or not (or a negative sign plus a sign) indicates that the individual number is positive or negative. I only need one bit in binary, I am negative or positive. This is the "sign" bit in this floating point format (or whatever).

Then at some point in elementary school we learned about decimal points after doing some time on fractions. And it was just a period that we put between the two numbers indicated where the last integer was and where the fraction began. I could just stop there and say that there is no reason for base 2 to differ from base 10 from base 13 from base 27, you just put the period between two numbers to indicate where the last integer is and the first part of the fraction But floating point goes a little further. Now it may have been in elementary school or later in high school, but in the end they taught us scientific notation and / or other ways of representing numbers, by moving that decimal point, the kind of decimal point still represented the border between the last integer and the beginning faction, but aside,which we multiplied by the base number by the power

12345.67 = 1.234567 * 10^4

      

And that's the rest of the puzzle. With pencil and paper, if we have enough paper and enough pencil lead (graphite), we can write numbers with as many digits as we need, but, as you already know, with integers, we are usually limited by the register size, now we can use the knowledge of other classes to turn 8-bit alu into an infinite number of alu bits (as long as we have enough memory / memory bits), but we're still dealing with things 8 bits at a time in this case. In this case, they originally chose the 32, 64 and 80 bit format (or maybe it came later), so our bits are strictly limited to those numbers (we now have 16 bits and maybe less, although that doesn't make much sense) and they are using some kind of temporary basis for the power figure. Something is mantissa 1.234567 above,but stored without decimal point 1234567. Estimated / agreed decimal point location (known). This is the first nonzero digit, so 123456.7 we will move it to 1.234567 and adjust the exponent to 78.45, we will move it to 7.845 and adjust the exponent on the base factor. Since it is binary, there is only one value that is not zero, but one (bit is 0 or 1), so 011101.000 we will move it to 1.110100 and adjust the exponent. (this is like scientific notation, but base 2)there is only one value that is not zero but one (bit is 0 or 1), so 011101.000 we will move it to 1.110100 and adjust the exponent. (this is like scientific notation, but base 2)there is only one value that is not zero but one (bit is 0 or 1), so 011101.000 we will move it to 1.110100 and adjust the exponent. (this is like scientific notation, but base 2)

Then the number of bits in that mantissa, or significant digits in scientific notation, if you want to think of it this way, is limited to 23 or some bit form, see the wikipedia page in single precision floating point format (32 bit, double precision - 64 bits and works exactly the same, just more mantissa and exponent bits). Thus, we take our number, no matter how many digits it finds, we find the most significant one, we move it to the decimal point and adjust the expander by a factor in the same way as we did above



11101.01 = 1.110101 * 2^4

      

we technically don't need to store 1 to the decimal point, and we don't need to store 2, but we need to store 110101, and we need to store 4 in binary. Along with the sign in the above case indicating positive, so the sign, exponent and mantissa, and we can reconstruct this number. Or any number that matches not very small or really large (so that the exponent does not fit into the allocated number of bits).

Then the IEEE-754 folks took one final step and instead of just coding the exponent number, since they are using a kind of inverse double's complement. We already know from integer math on computers about 2C's complement and how to figure out what those numbers look like. they didn't exactly do that for some reason it would make a lot more sense, but instead they stated that 1000 ... 0000 one with all zeros in binary is a midpoint definition or some other way of looking at it all zeros are the smallest number and they are all the largest exhibitors and you have to correct it. We know that of the two's complements to an 8-bit number in this case, the largest number is +127 and the smallest number -128 that they did change that.so that they can have a higher positive exponent instead of +127 to -128, as in double's complement it is back +128 to -127, for us it just means we adjust by adding 127. In my example 2 to 4 is binary for 4 is 100 using 8 bits of binary's complement which is 00000100, to "encode" it into IEEE 754 single precision floating point format which becomes 10000011, I just added 127 or added 128 (10000100) and then subtracted one.it's in IEEE 754 single precision floating point format which becomes 10000011, I just added 127 or added 128 (10000100) and then subtracted one.it's in IEEE 754 single precision floating point format which becomes 10000011, I just added 127 or added 128 (10000100) and then subtracted one.

So I lied, there are a few more things, in special cases, so far we have one bit for the sign, it is positive or negative, 8 bits for the encoded power exponent of a factor of 2, and we have the mantissa or fraction of the significant digits bits in our room. But what about zero there is no non-zero bit in zero, how do we represent this number? Well this is a special case, almost a hardcoded number, but you can actually format +0 and -0 with different bit patterns, but later versions of the spec, I think, encourage or dictate that the math that leads to zero is positive but I don’t know that for sure I haven’t seen a copy of the specification in many years, as you have to pay for it to get it legally. other special cases are called NaN or are not a number,they are also special bit patterns that are known to represent NaNs ... and there is more than one nank as you can put different patterns in the mantissa. These will be cases, for example, when you are dividing by zero, or when your number is so large that you cannot represent it at 2 times up to cardinality N, because N is too large for the number of bits encoded in the exponent (greater than +128 before single-precision encoding) or too low a number (exponent less than -127). Although some formats have numbers called tiny numbers or denormals, and those that are not 1.xxxx allow one miss and have 0.000 ... 1xxxx which is not a valid format but just slightly less than the smallest number we can represent , some fpus / software don't support denormals.

now go to wikipedia and search for single precision floating point format and this page should make a lot of sense now ... Hopefully ...

+1


source







All Articles