Python QQ and PP plot of two distributions of unequal length

I'm not sure what is the best / most statistically sound way to accomplish what I want, but I'm basically trying to take the p-value distribution and compare it to the much larger p-value distribution generated by permuting my original data. I am working with small p values, so I am actually comparing the log10 of the p values.

I am trying to find a good general way to compare two arrays with the same values ​​but unequal lengths. I really want something like scipy.qqplot(dataset1, dataset2)

, but it doesn't exist, the QQ graph compares your distribution to the installed distribution (this question has also been asked for R: https://stats.stackexchange.com/questions/12392/how-to-compare-two -datasets-with-qq-plot-using-ggplot2 ).

It essentially boils down to comparing two histograms. I can use np.linspace to force the same cells for each distribution:

bins = 100
mx = max(np.max(vector1), np.max(vector2))
mn = min(np.min(vector2), np.max(vector2))
boundaries = np.linspace(mn, mx, bins, endpoint=True)
labels = [(boundaries[i]+boundaries[i+1])/2 for i in range(len(boundaries)-1)]

      

I can then easily use those bounds and labels to create two histograms, weighted by the length of the original vectors. The simplest is to just use multiple boxes and plot them as histograms on the same axis like in this question:

However, I really want something more like a QQ plot, and I want to use a lot of boxes so that I can see even small deviations from the 1-to-1 line. The problem with just plotting two histograms is that they look like this:

histogram_example

The two plots are right on top of each other, I can't see anything.

So I want to figure out how to compare these two histograms while keeping bin labels. I can easily plot the two mutually each other as a scatter plot, but that ends up indexing by the bin frequency:

definitely wrong

What I really want is to just compare two histograms or make a QQ plot of the differences, but I can't think of a good statistically valid way to do this. I cannot find any methods that would allow me to plot a QQ plot with two datasets instead of one dataset and embedded distribution, and I cannot find a way to plot two distributions of unequal length against each other.

For your reference, here are the two histograms that went into creating this plot, you can see that they are very similar:

histograms

I know there must be a good way to do this because it seems so obvious, but I'm new to this kind of thing and relatively new to scipy, pandas and statsmodels as well.

I intentionally did not provide an example of an allocation here because I was not sure how to create a minimal set of arrays that were non-propagating and captured by what I am trying to do; plus point should be able to do this for any two overlapping arrays of uneven length.

What I want to know is what is the correct / best way to approach this problem in python in a statistically valid way? Is there a way to create a distribution from permutation data that could be used for statsmodels or scipy QQ templates? Is there a way to compare two histograms visually like this already? Is there a way to make possible plots that I am not aware of?


Edit: using cumulative and manual QQ plots

Thanks to @ user333700's answer, I figured out how to create a manual QQ plot for the data as well as a cumulative probability plot. I created plots using data with overlapping min / max, but the following distributions:

manufactured distributions

QQ figure:

q = np.linspace(0, 100, 101)
fig, ax = plt.subplots()
ax.scatter(np.percentile(ytest, q), np.percentile(xtest, q))

      

qqplot

So this works well with simple data, the cumulative plot is like:

# Pick bins
x = ytest
y = xtest
boundaries = sorted(x)[::round(len(x)/bins)+1]
labels = [(boundaries[i]+boundaries[i+1])/2 for i in range(len(boundaries)-1)]

# Bin two series into equal bins
xb = pd.cut(x, bins=boundaries, labels=labels)
yb = pd.cut(y, bins=boundaries, labels=labels)

# Get value counts for each bin and sort by bin
xhist = xb.value_counts().sort_index(ascending=True)/len(xb)
yhist = yb.value_counts().sort_index(ascending=True)/len(yb)

# Make cumulative
for ser in [xhist, yhist]:
    ttl = 0
    for idx, val in ser.iteritems():
        ttl += val
        ser.loc[idx] = ttl

# Plot it
fig, ax = plt.subplots(figsize=(6,6))
ax.scatter(xhist, yhist)
plt.show()

      

cumulative plot

Going back to my actual garbled data (where the two distributions are extremely similar in every way except for lengths) and adding line 1 to 1, I get this for these two:

graphs with real data

So both working, which is great, and the cumulative probability plot show quite clearly that there isn't much difference in the data, but the QQ plot shows that there is little difference in the tail.

+3


source to share


1 answer


In terms of statistical tests, Scipy has two samples of the Kolmogorov-Smirnov test for continuous variables. Binogram data with binns can be used with test testing. scipy.stats also has a k-sample Anderson-Darling test.

For building:



The equivalent of plotting the probability for two histograms is plotting the cumulative frequencies for the two samples, i.e. with cumulative probabilities on each axis corresponding to the bin boundaries.

statsmodels has a qq plot to compare the two samples, however, it is currently assumed that the sample sizes are the same. If the sample sizes are different, then the quantile must be calculated for the same probabilities. https://github.com/statsmodels/statsmodels/issues/2896 https://github.com/statsmodels/statsmodels/pull/3169 (I don't remember what this status is.)

+2


source







All Articles