Why is the average in this array greater than the maximum?

I found myself with a very confusing array in Python. Below is the output from iPython when I work with it (with the pylab flag):

In [1]: x = np.load('x.npy')

In [2]: x.shape
Out[2]: (504000,)

In [3]: x
Out[3]: 
array([ 98.20354462,  98.26583099,  98.26529694, ...,  98.20297241,
        98.19876862,  98.29492188], dtype=float32)

In [4]: min(x), mean(x), max(x)
Out[4]: (97.950058, 98.689438, 98.329773)

      

I have no idea what's going on. Why does the mean () function provide what is obviously the wrong answer?

I don't even know where to start debugging this problem.

I am using Python 2.7.6.


I would like to share a file .npy

if needed.

+3


source to share


2 answers


Probably because of a copied round-off error when calculating mean (). float32 relative precision is ~ 1e-7 and you have 500000 elements -> ~ 5% rounding when calculating the sum directly ().

The algorithm for computing sum () and mean () is more complex (pairwise summation) in the latest version of Numpy 1.9.0:



>>> import numpy
>>> numpy.__version__
'1.9.0'
>>> x = numpy.random.random(500000).astype("float32") + 300
>>> min(x), numpy.mean(x), max(x)
(300.0, 300.50024, 301.0)

      

At the same time, you can use a higher precision battery type: numpy.mean(x, dtype=numpy.float64)

+7


source


I've included a snippet from np.mean.__doc__

below. You should try using np.mean(x, dtype=np.float64)

.



-----
The arithmetic mean is the sum of the elements along the axis divided
by the number of elements.

Note that for floating-point input, the mean is computed using the
same precision the input has.  Depending on the input data, this can
cause the results to be inaccurate, especially for `float32` (see
example below).  Specifying a higher-precision accumulator using the
`dtype` keyword can alleviate this issue.

In single precision, `mean` can be inaccurate:

>>> a = np.zeros((2, 512*512), dtype=np.float32)
>>> a[0, :] = 1.0
>>> a[1, :] = 0.1
>>> np.mean(a)
0.546875

Computing the mean in float64 is more accurate:

>>> np.mean(a, dtype=np.float64)
0.55000000074505806

      

+3


source







All Articles