Gaussian kernel density estimate (KDE) for large numbers in Python

Question

Gaussian kernel density estimate (KDE) for large numbers in Python

I have 1000 large numbers randomly distributed in the range 37231 to 56661.

I am trying to use stats.gaussian_kde

but something is not working. (perhaps because of my poor knowledge of statistics?).

Here is the code:

from scipy import stats.gaussian_kde
import matplotlib.pyplot as plt

# 'data' is a 1D array that contains the initial numbers 37231 to 56661
xmin = min(data)
xmax = max(data)   

# get evenly distributed numbers for X axis.
x = linspace(xmin, xmax, 1000)   # get 1000 points on x axis
nPoints = len(x)

# get actual kernel density.
density = gaussian_kde(data)
y = density(x)

# print the output data
for i in range(nPoints):
    print "%s   %s" % (x[i], y[i])

plt.plot(x, density(x))
plt.show()

In the printout, I get x values in column 1 and zeros in column 2. The graph shows a flat line.

I just can't seem to find a solution. I tried a very wide range of X-es, same result.

What is the problem? What am I doing wrong? Could large numbers be the cause?

+3

python scipy matplotlib statistics

Proteos 21 Mar At 11:37 pm

source to share

2 answers

I made a function for this. You can change the bandwidth as a parameter to the function. That is, smaller number = more pointy, larger number = smoother. The default is 0.3.

He works in IPython notebook --pylab=inline

The number of bins is optimized and coded so will vary based on the number of variables in your data.

import scipy.stats as stats
import numpy as np

def hist_with_kde(data, bandwidth = 0.3):
    #set number of bins using Freedman and Diaconis
    q1 = np.percentile(data,25)
    q3 = np.percentile(data,75)


    n = len(data)**(.1/.3)
    rng = max(data) - min(data)
    iqr = 2*(q3-q1)
    bins = int((n*rng)/iqr)

    x = linspace(min(data),max(data),200)

    kde = stats.gaussian_kde(data)
    kde.covariance_factor = lambda : bandwidth
    kde._compute_covariance()

    plot(x,kde(x),'r') # distribution function
    hist(data,bins=bins,normed=True) # histogram

data = np.random.randn(500)
hist_with_kde(data,0.25)

+1

John Apr 26. 13 at 9:08 am

source to share

DSM · Accepted Answer · 2012-03-21T23:59:09+0000

I think what is happening is that your data array consists of integers, which leads to problems:

>>> import numpy, scipy.stats
>>> 
>>> data = numpy.random.randint(37231, 56661,size=10)
>>> xmin, xmax = min(data), max(data)
>>> x = numpy.linspace(xmin, xmax, 10)
>>> 
>>> density = scipy.stats.gaussian_kde(data)
>>> density.dataset
array([[52605, 45451, 46029, 40379, 48885, 41262, 39248, 38247, 55987,
        44019]])
>>> density(x)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

but if we use float:

>>> density = scipy.stats.gaussian_kde(data*1.0)
>>> density.dataset
array([[ 52605.,  45451.,  46029.,  40379.,  48885.,  41262.,  39248.,
         38247.,  55987.,  44019.]])
>>> density(x)
array([  4.42201513e-05,   5.51130237e-05,   5.94470211e-05,
         5.78485526e-05,   5.21379448e-05,   4.43176188e-05,
         3.66725694e-05,   3.06297511e-05,   2.56191024e-05,
         2.01305127e-05])

Gaussian kernel density estimate (KDE) for large numbers in Python

More articles: