How does scipy.stats handle nans?

I am trying to do some statistics in Python. I have data with multiple missing values ​​filled np.nan

in and I'm not sure if I should delete it manually or if scipy can handle it. So I tried both:

 import scipy.stats, numpy as np
a = [0.75, np.nan, 0.58337, 0.75, 0.75, 0.91663, 1.0, np.nan, 0.663, 0.837,     0.837, 1.0, 0.663, 1.0, 1.0, 0.91663, 0.75, 0.41669, 0.58337, 0.663, 0.75, 0.58337] 
b = [0.837, np.nan, 0.663, 0.58337, 0.75, 0.75, 0.58337, np.nan, 0.166, 0.5,     0.663, 1.0, 0.91663, 1.0, 0.663, 0.75, 0.75, 0.41669, 0.331, 0.25, 1.0, 0.91663]

d_1, d_2 = a,b
wilc1 =  scipy.stats.wilcoxon(d_1, d_2, zero_method = 'pratt')

d_1, d_2 = [], []
for d1, d2 in zip(a, b):
    if np.isnan(d1) or np.isnan(d2):
        pass
    else:
        d_1.append(d1)
        d_2.append(d2)

wilc2 =  scipy.stats.wilcoxon(d_1, d_2, zero_method = 'pratt')
print wilc1
print wilc2

      

I am getting two warnings:

C:\Python27\lib\site-packages\scipy\stats\morestats.py:1963: RuntimeWarning: invalid value encountered in greater
  r_plus = sum((d > 0) * r, axis=0

      

and two Wilcoxon outputs

(54.0, 0.018545881687477818)
(54.0, 0.056806600853965265)

      

As you can see, I have two similar test statistics (W) and two different P values. Which one is correct?

My guess is that Wilcoxon handles the missing values ​​correctly during the statistic calculation of the test, but when calculating the P-value it uses the len () of all data, not just the valid cases. Could this be considered a mistake?

+3


source to share


1 answer


You cannot mathematically execute nan-based test statistics. Unless you find evidence / documentation of a nano special treatment, you cannot rely on it.

My experience is that in general, even numpy does not handle nan specifically, for example for the median. Instead, the results will be whatever they are as a result of the implementation of the algorithm.



For example, numpy.median () seems to end up treating nan as inf, putting nan above the median. This is most likely a side effect of the comparison results a<b

always being false for nan. A similar effect is probably behind your two identical test statistic W.

Also note, there are several method variations in numpy like http://docs.scipy.org/doc/numpy/reference/generated/numpy.nanmean.html

+2


source







All Articles