Removing outliers in each column (and corresponding row)

My Numpy array contains 10 columns and about 2 million rows.

Now I need to analyze each column separately, find the values ​​that are outliers; and remove the whole string from the array.

So, I would start parsing column 0; find outliers in line 10,20,100; and remove those lines. Then I started parsing column 1 into the now trimmed array; and apply the same process.

Of course, I can think of a normal manual process to do this (iterate over each column, find the indices that are outliers, delete a row, go to another column), but I've always found that Numpy contains some quick nifty tricks for doing such statistical tasks.

And if you can compute the cost of executing the method a little; even better.

I am not limited to the NumPy library here, if SciPy has something useful then no problem using it.

Thank!

+3


source to share


2 answers


Two very simple approaches, the second with a little more sophistication:

arr = np.random.randn(2e6, 10)

def remove_outliers(arr, k):
    mu, sigma = np.mean(arr, axis=0), np.std(arr, axis=0, ddof=1)
    return arr[np.all(np.abs((arr - mu) / sigma) < k, axis=1)]

def remove_outliers_bis(arr, k):
    mask = np.ones((arr.shape[0],), dtype=np.bool)
    mu, sigma = np.mean(arr, axis=0), np.std(arr, axis=0, ddof=1)
    for j in range(arr.shape[1]):
        col = arr[:, j]
        mask[mask] &= np.abs((col[mask] - mu[j]) / sigma[j]) < k
    return arr[mask]

      

Performance depends on how many outputs you have:

In [38]: %timeit remove_outliers(arr, 1)
1 loops, best of 3: 1.13 s per loop

In [39]: %timeit remove_outliers_bis(arr, 1)
1 loops, best of 3: 983 ms per loop

In [40]: %timeit remove_outliers(arr, 2)
1 loops, best of 3: 1.21 s per loop

In [41]: %timeit remove_outliers_bis(arr, 2)
1 loops, best of 3: 1.51 s per loop

      



And of course:

In [42]: np.allclose(remove_outliers(arr, 1), remove_outliers_bis(arr, 1))
Out[42]: True

In [43]: np.allclose(remove_outliers(arr, 2), remove_outliers_bis(arr, 2))
Out[43]: True

      

I would say that the complication of the second method does not justify its potential speedup, but YMMV ...

+4


source


The best performance depends on the relative costs of outlier detection, line removal, and outlier frequency.

If the departure frequency is not very high, I would do the following:

  • create a boolean outlier table (one item for each item in the original table)
  • sum the table along the axis (sum of each row)
  • create a new table where there are only rows where the outlier sum is 0

Deleting rows individually is time consuming, and unless the exposure is too expensive, the extra work due to the possible detection of multiple outliers on the same row is negligible.



As code, it will look something like this:

outliers = find_outliers(data)
data_without_outliers = data[outliers.sum(axis=1) == 0]

      

where find_outliers

creates a boolean table of outlier status (i.e. True

, if the corresponding element in the original array data

is outlier).

I assume performance depends on your outlier detection algorithm. If you can keep it simple and vectorized, it's fast.

0


source







All Articles