How many bits of precision for a double are between -1.0 and 1.0?

In some of the audio libraries I've looked at, sample audio is often represented as double or float, with a range of -1.0 to 1.0. In some cases, this easily allows parsing and synthesis code to abstract away from what the underlying data type might be (signed long int, unsigned char, etc.).

Assuming IEEE 754, we have uneven density. As the number approaches zero, the density increases. This means we have less precision for numbers approaching -1 and 1.

This uneven number density doesn't matter if we can represent enough values ​​for the underlying datatype we are converting to / from.

For example, if the underlying datatype was an unsigned char, we only need 256 values ​​from -1 to 1 (or 8 bits) - using double is clearly not a problem.

My question is, how many bits of precision do I have? Is it safe to convert to / from 32-bit integer without loss? To expand on the question, what range of values ​​would it be safe to convert to / from a 32-bit integer without loss?



source to share

2 answers

For IEEE twins, you have a 53 bit mantissa which is sufficient to represent 32 bit integers, considered fixed points between -1 (0x80000000) and 1 - 2 ^ -31 (0x7FFFFFFF).

The floats have 24-bit mantissas, which is not enough.



As Alexandre C. explains, IEEE doubles has a 53-bit mantissa (52 retained and the upper bit is implied) and float has 24 bits (23 bits are retained and the upper bit is implied).

Edit: (Thanks for the feedback, I hope this is clearer)

When an integer is converted to a double double f = (double)1024;

, the number is held at the appropriate exponent (1023 + 10), and the same pattern bit is effectively stored as the original integer (actually IEEE binary floating point does not store the upper bit. IEEE floating point numbers " normalized "to have the top bit = 1 by adjusting the exponent, then the top 1 is truncated because it is" implied ", which saves some memory).

A 32-bit integer would require a double for its value to be fine, and an 8-bit integer would hold perfectly in the float. There is no information there. It can be converted back to a lossless integer. Loss occurs with arithmetic and fractional values.

The integer is not mapped to +/- 1, unless the code does. When the code divides this 32-bit integer, stored as a double, to match it with the +/- 1 range, an error will most likely be introduced.

This mapping with +/- 1 will lose some of the 53-bit precision, but the error will only be in the least significant bits, well below the 32 bits required for the original integer. Subsequent operations may also lose accuracy. For example, multiplying two numbers with a resulting range of more than 53 bits of precision will result in the loss of some bits (i.e. Multiplying two numbers with a mantissa of more than 27 significant bits).

An explanation of floating point, which may be helpful, “What Every Computer Scientist Should Know About Floating Point Arithmetic” This explains some of the counter-intuitive (to me) behavior of floating point numbers.

For example, the number 0.1 may not be stored exactly in the IEEE binary double variable.

This program can help you see what's going on:

/* Demonstrate IEEE 'double' encoding on x86
 * Show bit patterns and 'printf' output for double values
 * Show error representing 0.1, and accumulated error of adding 0.1 many times
 * G Bulmer 2012

#include <stdio.h>

typedef struct {
    unsigned long long mantissa :52; 
    unsigned exponent :11; 
    unsigned sign :1; 
} double_bits;
const unsigned exponent_offset = 1023;

typedef union { double d; unsigned long long l; double_bits b; } Xlate;

void print_xlate(Xlate val) {
    const long long IMPLIED = (1LL<<52);
    if (val.b.exponent == 0) { /* zero? */
        printf("val: d: %19lf  bits: %016llX [sign: %u exponent: zero=%u mantissa: %llX]\n", 
               val.d, val.l, val.b.sign, val.b.exponent, val.b.mantissa);
    } else {
        printf("val: d: %19lf  bits: %016llX [sign: %u exponent: 2^%4-d mantissa: %llX]\n", 
               val.d, val.l, val.b.sign, ((int)val.b.exponent)-exponent_offset, 

double add_many(double d, int many) {
    double accum = 0.0;
    while (many-- > 0) {    /* only works for +d */
        accum += d;
    return accum;

int main (int argc, const char * argv[]) {
    Xlate val;
    val.b.sign = 0;
    val.b.exponent = exponent_offset+1;
    val.b.mantissa = 0;


    val.d = 1.0;                        print_xlate(val);

    val.d = 0.0;                        print_xlate(val);

    val.d = -1.0;                       print_xlate(val);

    val.d = 3.0;                        print_xlate(val);

    val.d = 7.0;                        print_xlate(val);

    val.d = (double)((1LL<<31)-1LL);    print_xlate(val);

    val.d = 2147483647.0;               print_xlate(val);

    val.d = 10000.0;                    print_xlate(val);

    val.d = 100000.0;                   print_xlate(val);

    val.d = 1000000.0;                  print_xlate(val);

    val.d = 0.1;                        print_xlate(val);

    val.d = add_many(0.1, 100000);

    val.d = add_many(0.1, 1000000);

    val.d = add_many(0.1, 10000000);

    val.d = add_many(0.1,10);           print_xlate(val);
    val.d *= 2147483647.0;              print_xlate(val);
    int i = val.d;                      printf("int i=truncate(d)=%d\n", i);
    int j = lround(val.d);              printf("int i=lround(d)=%d\n", j);

    val.d = add_many(0.0001,1000)-0.1;  print_xlate(val);

    return 0;



val: d:            2.000000  bits: 4000000000000000 [sign: 0 exponent: 2^1    mantissa: 10000000000000]
val: d:            1.000000  bits: 3FF0000000000000 [sign: 0 exponent: 2^0    mantissa: 10000000000000]
val: d:            0.000000  bits: 0000000000000000 [sign: 0 exponent: zero=0 mantissa: 0]
val: d:           -1.000000  bits: BFF0000000000000 [sign: 1 exponent: 2^0    mantissa: 10000000000000]
val: d:            3.000000  bits: 4008000000000000 [sign: 0 exponent: 2^1    mantissa: 18000000000000]
val: d:            7.000000  bits: 401C000000000000 [sign: 0 exponent: 2^2    mantissa: 1C000000000000]
val: d:   2147483647.000000  bits: 41DFFFFFFFC00000 [sign: 0 exponent: 2^30   mantissa: 1FFFFFFFC00000]
val: d:   2147483647.000000  bits: 41DFFFFFFFC00000 [sign: 0 exponent: 2^30   mantissa: 1FFFFFFFC00000]
val: d:        10000.000000  bits: 40C3880000000000 [sign: 0 exponent: 2^13   mantissa: 13880000000000]
val: d:       100000.000000  bits: 40F86A0000000000 [sign: 0 exponent: 2^16   mantissa: 186A0000000000]
val: d:      1000000.000000  bits: 412E848000000000 [sign: 0 exponent: 2^19   mantissa: 1E848000000000]
val: d:            0.100000  bits: 3FB999999999999A [sign: 0 exponent: 2^-4   mantissa: 1999999999999A]
val: d:        10000.000000  bits: 40C388000000287A [sign: 0 exponent: 2^13   mantissa: 1388000000287A]
val: d:       100000.000001  bits: 40F86A00000165CB [sign: 0 exponent: 2^16   mantissa: 186A00000165CB]
val: d:       999999.999839  bits: 412E847FFFEAE4E9 [sign: 0 exponent: 2^19   mantissa: 1E847FFFEAE4E9]
val: d:            1.000000  bits: 3FEFFFFFFFFFFFFF [sign: 0 exponent: 2^-1   mantissa: 1FFFFFFFFFFFFF]
val: d:   2147483647.000000  bits: 41DFFFFFFFBFFFFF [sign: 0 exponent: 2^30   mantissa: 1FFFFFFFBFFFFF]
int i=truncate(d)=2147483646
int i=lround(d)=2147483647
val: d:            0.000000  bits: 3CE0800000000000 [sign: 0 exponent: 2^-49  mantissa: 10800000000000]


This shows that a full 32-bit int is represented exactly, but 0.1 is not. It shows that printf does not print exactly a floating point number, but rounds or truncates (something to be wary of). It also illustrates that the error count in this representation of 0.1 does not accumulate to a high enough value of 1,000,000 appends to cause printf to print it. It shows that the original integer can be restored by rounding, but not assigned because the assignment is truncated. This shows that the subtraction operation can "amplify" the error (whatever remains after the subtraction is an error), and therefore arithmetic must be carefully analyzed.

Put this in a music context where the sampling rate can be 96 kHz. It would take more than 10 seconds of padding before the error generated enough to force the upper 32 bits to contain more than 1 bit in error.

Further. Christopher "Monty" Montgomery, creator of Ogg and Vorbis, argues that 24 bits should be more than enough for audio in an article on music, sample rate and 24/192 sample resolution Downloading music ... and why they don't make sense

The summary
double is 32 bit integers. There are rational decimal numbers of the form N / M (where M and N can be represented by a 32-bit integer), which may not be represented by a finite binary fraction bins. Thus, when an integer is mapped to the +/- 1 range and is therefore converted to a rational number (N / M), some numbers cannot be represented by a finite number of bits in the double fractional part, so errors will creep into.

These errors are generally very small, in the least significant bits, hence well below the upper 32 bits. Thus, they can be converted back and forth between integers and doubles using rounding, and a double notation error will not result in an invalid integer. BUT arithmetic can change the error. Incorrectly constructed arithmetic can lead to rapid growth in errors and can grow to a value where the original integer value was corrupted.

Other thoughts: If accuracy is important, there are other ways to use doubles. None of these are as convenient as mapping to +/- 1. All I can think of would require keeping track of arithmetic operations, which is best done using C ++ wrapper classes. This would drastically reduce computation, so it might be pointless.

This is a very tricky way to do "Automatic Separation" by wrapping arithmetic in classes that keep track of additional information. I think the ideas out there can inspire an approach. It can even help determine where the loss of precision is.



All Articles