Fastest way to filter lists of lists based on third list?

I have a list A

as shown below:

A = np.array([[1,2] ,
              [2,4] ,
              [3,4] , 
              [4,5] , 
              [6,7]]) 

      

and I need to remove all signatures containing any items from the third list B

.

So, if, for example:

B = [1,2,5]

      

Expected Result:

np.array([[3,4] ,
          [6,7]]) 

      

The length of A reaches 1,500,000, and B is also often in tens of thousands of elements, so performance is critical. The length of subscriptions is A

always 2.

+3


source to share


2 answers


All approaches given here are based on numpys boolean indexing . The approach is to identify matches (regardless of the line) and then use abbreviation ( np.any

or np.all

) along the lines to see which lines should be removed and which should be retained. Finally, this mask is applied to your array A

to get only valid strings. The only real difference between the approaches is how you create the mask.

Approach 1:

If the values ​​are B

known in advance, you usually use collations |

(or operators).

a[~np.any(((a == 1) | (a == 2) | (a == 5)), axis=1)]

      

I will go through in stages:

  • Search for matches

    >>> ((a == 1) | (a == 2) | (a == 5))
    array([[ True,  True],
           [ True, False],
           [False, False],
           [False,  True],
           [False, False]], dtype=bool)
    
          

  • Check each line for one True

    :

    >>> np.any(((a == 1) | (a == 2) | (a == 5)), axis=1)
    array([ True,  True, False,  True, False], dtype=bool)
    
          

  • Invert it:

    >>> ~np.any(((a == 1) | (a == 2) | (a == 5)), axis=1)
    array([False, False,  True, False,  True], dtype=bool)
    
          

  • Use boolean indexing:

    >>> a[~np.any(((a == 1) | (a == 2) | (a == 5)), axis=1)]
    array([[3, 4],
           [6, 7]])
    
          

Approach 2:

Instead of these, a == 1 | a == 2 | ...

you can also use np.in1d

:

>>> np.in1d(a, [1, 2, 5]).reshape(a.shape)
array([[ True,  True],
       [ True, False],
       [False, False],
       [False,  True],
       [False, False]], dtype=bool)

      

and then use essentially the same approach as above

>>> a[~np.any(np.in1d(a, [1, 2, 5]).reshape(a.shape), axis=1)]
array([[3, 4],
       [6, 7]])

      

Approach 3:

In the case of sorting, B

you can also use np.searchsorted

to create a mask:

>>> np.searchsorted([1, 2, 5], a, side='left') == np.searchsorted([1, 2, 5], a, side='right')
array([[False, False],
       [False,  True],
       [ True,  True],
       [ True, False],
       [ True,  True]], dtype=bool)

      

This time, you will need to check if the values all

in the achievement line match True

:

>>> b = [1, 2, 5]
>>> a[np.all(np.searchsorted(b, a, side='left') == np.searchsorted(b, a, side='right'), axis=1)]
array([[3, 4],
       [6, 7]])

      

Timings:



The first approach is not suitable for arbitary B

, so I do not include it in these timings.

import numpy as np

def setapproach(A, B):  # author: Max Chrétien
    B = set(B)
    indices_to_del = [i for i, sublist in enumerate(A) if B & set(sublist)]
    C = np.delete(A, indices_to_del, 0)
    return C

def setapproach2(A, B):  # author: Max Chrétien & Ev. Kounis
    B = set(B)
    return np.array([sublist for sublist in A if not B & set(sublist)])

def isinapproach(a, b):
    return a[~np.any(np.in1d(a, b).reshape(a.shape), axis=1)]

def searchsortedapproach(a, b):
    b.sort()
    return a[np.all(np.searchsorted(b, a, side='left') == np.searchsorted(b, a, side='right'), axis=1)]

A = np.random.randint(0, 10000, (100000, 2))
B = np.random.randint(0, 10000, 2000)

%timeit setapproach(A, B)
# 929 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit setapproach2(A, B)
# 1.04 s ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit isinapproach(A, B)
# 59.1 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit searchsortedapproach(A, B)
# 56.1 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

      

The time intervals, however, depend on a range of values, if B

already sorted and length A

, B

. But numpy approaches the seams nearly 20x faster than a set of solutions. However, the main difference is that iterating over numpy arrays with python loops is really inefficient, so I first convert A

and B

to list

:

def setapproach_updated(A, B):
    B = set(B)
    indices_to_del = [i for i, sublist in enumerate(A.tolist()) if B & set(sublist)]
    C = np.delete(A, indices_to_del, 0)
    return C

def setapproach2_updated(A, B):
    B = set(B)
    return np.array([sublist for sublist in A.tolist() if not B & set(sublist)])

      

It may sound strange, but repeat the timings:

%timeit setapproach_updated(A, B)
# 300 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit setapproach2_updated(A, B)
# 378 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

      

It's much faster than simple loops, only converting it from first tolist

, but still 5+ times slower than numpy approximates.

So, remember: When you need to use Python-based approaches on NumPy arrays, check if it's faster to convert it to a list first!

Let's see how this works on large arrays (these are the sizes that roughly correspond to those indicated in your question):

A = np.random.randint(0, 10000000, (1500000, 2))
B = np.random.randint(0, 10000000, 50000)

%timeit setapproach_updated(A, B)
# 4.14 s ± 66.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit setapproach2_updated(A, B)
# 6.33 s ± 95.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit isinapproach(A, B)
# 2.39 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit searchsortedapproach(A, B)
# 1.34 s ± 21.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

      

The differences have narrowed and the winner is searchsorted

determined.

Approach 4:

I'm not done yet! Let me surprise you, it is not a lightweight package , but extremely powerful if it supports the types and functions you need:

import numba as nb

@nb.njit                # the magic is this decorator
def numba_approach(A, B):
    Bset = set(B)
    mask = np.ones(A.shape[0], dtype=nb.bool_)
    for idx in range(A.shape[0]):
        for item in A[idx]:
            if item in Bset:
                mask[idx] = False
                break
    return A[mask]

      

Let's see how it works:

A = np.random.randint(0, 10000, (100000, 2))
B = np.random.randint(0, 10000, 2000)

numba_approach(A, B)   # numba needs a warmup run because it just-in-time compiling

%timeit numba_approach(A, B)
# 6.12 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# This is 10 times faster than the fastest other approach!

A = np.random.randint(0, 10000000, (1500000, 2))
B = np.random.randint(0, 10000000, 50000)

%timeit numba_approach(A, B)
# 286 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# This is still 4 times faster than the fastest other approach!

      

So, you can do it an order of magnitude faster. Numba doesn't support all python / numpy features (and not all of them are faster), but enough in this case!

+4


source


Use set

- intersection to recreate a new list of indices where [1, 2, 5]

in your sublists. Then, with a list of indices to remove, use a np.delete()

function built into numpy.

import numpy as np

A = np.array([[1,2],
              [2,4],
              [3,4],
              [4,5],
              [6,7]])

B = set([1, 2, 5])

indices_to_del = [i for i, sublist in enumerate(A) if B & set(sublist)]

C = np.delete(A, indices_to_del, 0)

print C
#[[3 4]
# [6 7]]

      

EDIT



Thanks to @MSeifert, I was able to improve my answer.

@ Ev.Kounis suggested another similar but faster solution:

D = np.array([sublist for sublist in A if not B & set(sublist)])

      

+1


source







All Articles