Gaussian kernel density estimate (KDE) for large numbers in Python

I have 1000 large numbers randomly distributed in the range 37231 to 56661.

I am trying to use stats.gaussian_kde

but something is not working. (perhaps because of my poor knowledge of statistics?).

Here is the code:

from scipy import stats.gaussian_kde
import matplotlib.pyplot as plt

# 'data' is a 1D array that contains the initial numbers 37231 to 56661
xmin = min(data)
xmax = max(data)   

# get evenly distributed numbers for X axis.
x = linspace(xmin, xmax, 1000)   # get 1000 points on x axis
nPoints = len(x)

# get actual kernel density.
density = gaussian_kde(data)
y = density(x)

# print the output data
for i in range(nPoints):
    print "%s   %s" % (x[i], y[i])

plt.plot(x, density(x))
plt.show()

      

In the printout, I get x values ​​in column 1 and zeros in column 2. The graph shows a flat line.

I just can't seem to find a solution. I tried a very wide range of X-es, same result.

What is the problem? What am I doing wrong? Could large numbers be the cause?

+3


source to share


2 answers


I think what is happening is that your data array consists of integers, which leads to problems:

>>> import numpy, scipy.stats
>>> 
>>> data = numpy.random.randint(37231, 56661,size=10)
>>> xmin, xmax = min(data), max(data)
>>> x = numpy.linspace(xmin, xmax, 10)
>>> 
>>> density = scipy.stats.gaussian_kde(data)
>>> density.dataset
array([[52605, 45451, 46029, 40379, 48885, 41262, 39248, 38247, 55987,
        44019]])
>>> density(x)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

      



but if we use float:

>>> density = scipy.stats.gaussian_kde(data*1.0)
>>> density.dataset
array([[ 52605.,  45451.,  46029.,  40379.,  48885.,  41262.,  39248.,
         38247.,  55987.,  44019.]])
>>> density(x)
array([  4.42201513e-05,   5.51130237e-05,   5.94470211e-05,
         5.78485526e-05,   5.21379448e-05,   4.43176188e-05,
         3.66725694e-05,   3.06297511e-05,   2.56191024e-05,
         2.01305127e-05])

      

+6


source


I made a function for this. You can change the bandwidth as a parameter to the function. That is, smaller number = more pointy, larger number = smoother. The default is 0.3.

He works in IPython notebook --pylab=inline



The number of bins is optimized and coded so will vary based on the number of variables in your data.

import scipy.stats as stats
import numpy as np

def hist_with_kde(data, bandwidth = 0.3):
    #set number of bins using Freedman and Diaconis
    q1 = np.percentile(data,25)
    q3 = np.percentile(data,75)


    n = len(data)**(.1/.3)
    rng = max(data) - min(data)
    iqr = 2*(q3-q1)
    bins = int((n*rng)/iqr)

    x = linspace(min(data),max(data),200)

    kde = stats.gaussian_kde(data)
    kde.covariance_factor = lambda : bandwidth
    kde._compute_covariance()

    plot(x,kde(x),'r') # distribution function
    hist(data,bins=bins,normed=True) # histogram

data = np.random.randn(500)
hist_with_kde(data,0.25)

      

+1


source







All Articles