# Gaussian kernel density estimate (KDE) for large numbers in Python

I have 1000 large numbers randomly distributed in the range 37231 to 56661.

I am trying to use `stats.gaussian_kde`

but something is not working. (perhaps because of my poor knowledge of statistics?).

Here is the code:

``````from scipy import stats.gaussian_kde
import matplotlib.pyplot as plt

# 'data' is a 1D array that contains the initial numbers 37231 to 56661
xmin = min(data)
xmax = max(data)

# get evenly distributed numbers for X axis.
x = linspace(xmin, xmax, 1000)   # get 1000 points on x axis
nPoints = len(x)

# get actual kernel density.
density = gaussian_kde(data)
y = density(x)

# print the output data
for i in range(nPoints):
print "%s   %s" % (x[i], y[i])

plt.plot(x, density(x))
plt.show()
```

```

In the printout, I get x values ​​in column 1 and zeros in column 2. The graph shows a flat line.

I just can't seem to find a solution. I tried a very wide range of X-es, same result.

What is the problem? What am I doing wrong? Could large numbers be the cause?

+3

source to share

I think what is happening is that your data array consists of integers, which leads to problems:

``````>>> import numpy, scipy.stats
>>>
>>> data = numpy.random.randint(37231, 56661,size=10)
>>> xmin, xmax = min(data), max(data)
>>> x = numpy.linspace(xmin, xmax, 10)
>>>
>>> density = scipy.stats.gaussian_kde(data)
>>> density.dataset
array([[52605, 45451, 46029, 40379, 48885, 41262, 39248, 38247, 55987,
44019]])
>>> density(x)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
```

```

but if we use float:

``````>>> density = scipy.stats.gaussian_kde(data*1.0)
>>> density.dataset
array([[ 52605.,  45451.,  46029.,  40379.,  48885.,  41262.,  39248.,
38247.,  55987.,  44019.]])
>>> density(x)
array([  4.42201513e-05,   5.51130237e-05,   5.94470211e-05,
5.78485526e-05,   5.21379448e-05,   4.43176188e-05,
3.66725694e-05,   3.06297511e-05,   2.56191024e-05,
2.01305127e-05])
```

```
+6

source

I made a function for this. You can change the bandwidth as a parameter to the function. That is, smaller number = more pointy, larger number = smoother. The default is 0.3.

He works in `IPython notebook --pylab=inline`

The number of bins is optimized and coded so will vary based on the number of variables in your data.

``````import scipy.stats as stats
import numpy as np

def hist_with_kde(data, bandwidth = 0.3):
#set number of bins using Freedman and Diaconis
q1 = np.percentile(data,25)
q3 = np.percentile(data,75)

n = len(data)**(.1/.3)
rng = max(data) - min(data)
iqr = 2*(q3-q1)
bins = int((n*rng)/iqr)

x = linspace(min(data),max(data),200)

kde = stats.gaussian_kde(data)
kde.covariance_factor = lambda : bandwidth
kde._compute_covariance()

plot(x,kde(x),'r') # distribution function
hist(data,bins=bins,normed=True) # histogram

data = np.random.randn(500)
hist_with_kde(data,0.25)
```

```
+1

source

All Articles