How to estimate the density function and calculate its peaks?

I started using python for parsing. I would like to do the following:

  • Get the distribution of a dataset
  • Get peaks on this distribution

I used the gaussian_kde from scipy.stats to evaluate the kernel density function. Does guassian_kde represent any kind of assumption about the data ?. I am using data that changes over time. so if the data is of the same distribution (eg Gaussian) it may have a different distribution later. Does gaussian_kde have any drawbacks in this scenario? In question, it was suggested to install data in each distribution to get the distribution of the data. So what's the difference between using gaussian_kde and the answer in question... I used the code below, I am also curious to know if gaussian_kde is a good way to evaluate pdf if data changes over time ?. One advantage I know of gaussian_kde is that it automatically calculates the bandwidth using a rule of thumb, as in here . Also, how can I get his peaks?

import pandas as pd
import numpy as np
import pylab as pl
import scipy.stats
df = pd.read_csv('D:\dataset.csv')
pdf = scipy.stats.kde.gaussian_kde(df)
x = np.linspace((df.min()-1),(df.max()+1), len(df)) 
y = pdf(x)                          

pl.plot(x, y, color = 'r') 
pl.hist(data_column, normed= True)
pl.show(block=True)       

      

+3


source to share


1 answer


I think you need to distinguish the nonparametric density (implemented in scipy.stats.kde

) from the parametric density (the one mentioned in the fooobar.com/questions/89887 / ... question ). To illustrate the difference between the two, try the following code.

import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

np.random.seed(0)
gaussian1 = -6 + 3 * np.random.randn(1700)
gaussian2 = 4 + 1.5 * np.random.randn(300)
gaussian_mixture = np.hstack([gaussian1, gaussian2])

df = pd.DataFrame(gaussian_mixture, columns=['data'])

# non-parametric pdf
nparam_density = stats.kde.gaussian_kde(df.values.ravel())
x = np.linspace(-20, 10, 200)
nparam_density = nparam_density(x)

# parametric fit: assume normal distribution
loc_param, scale_param = stats.norm.fit(df)
param_density = stats.norm.pdf(x, loc=loc_param, scale=scale_param)

fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(df.values, bins=30, normed=True)
ax.plot(x, nparam_density, 'r-', label='non-parametric density (smoothed by Gaussian kernel)')
ax.plot(x, param_density, 'k--', label='parametric density')
ax.set_ylim([0, 0.15])
ax.legend(loc='best')

      

enter image description here



The graph shows that the nonparametric density is nothing more than a smoothed version of the histogram. In a histogram for a particular observation, x=x0

we use a bar to represent it (put the entire probability mass at this single point x=x0

and zero elsewhere), whereas in a nonparametric density estimate, we use a bell-shaped curve (Gaussian kernel) to represent that point (propagate around its vicinity ). And the result is a smoothed density curve. This inner Gaussian kernel has nothing to do with your assumption of the distribution over the underlying data x

. Its only purpose is anti-aliasing.

To get a nonparametric density mode, we need to do an exhaustive search, since density is not guaranteed by a single mode. As shown in the example above, if you start a quasi-Newton optimization between [5,10], it will most likely end up with a local optimal point rather than a global one.

# get mode: exhastive search
x[np.argsort(nparam_density)[-1]]

      

+7


source







All Articles