A quick (er) way of identifying all invalid elements in a list

I have a list of n arrays with 4 elements each, that is (n = 2):

l = [[1, 2, 3, 4], [5, 6, 7, 8]]

      

and I am trying to find all the elements of the list that are not dominant - that is, they are not dominant in any other element in the list. An array dominates another array if each element within it is less than or equal to the corresponding element in the other array. So

dominates([1, 2, 3, 4], [5, 6, 7, 8]) == True

      

how 1 <= 5 and 2 <= 6 and 3 <= 7 and 4 <= 8

. But

dominates([1, 2, 3, 9], [5, 6, 7, 8]) == False

      

how 9 > 8

. This function is relatively easy to write, for example:

def dominates(a, b):
    return all(i <= j for i, j in zip(a, b))

      

More succinctly, given l = [a1, a2, a3, .., an]

where a

are arrays of length 4, I am looking to find all a

that are not dominated by others a in l

.

I have the following solution:

def get_non_dominated(l):
    to_remove = set()
    for ind, item_1 in enumerate(l):
        if item_2 in to_remove:
            continue
        for item_2 in l[ind + 1:]:
            if dominates(item_2, item_1):
                to_remove.add(item_1)
                break
            elif dominates(item_1, item_2):
                to_remove.add(item_2)
    return [i for i in l if i not in to_remove]

      

So, get_non_dominated([[1, 2, 3, 4], [5, 6, 7, 8]])

should return [[1, 2, 3, 4]]

. Similarly, get_non_dominated([[1, 2, 3, 9], [5, 6, 7, 8]])

should return a list unchanged by the logic above (nothing dominates anything else).

But this check happens a lot, and l

potentially quite a lot. I was wondering if anyone has any ideas for speeding up this? My first thought was to try and vectorize this code with numpy, but I have relatively little experience with it and struggle a bit. You can assume that it l

has all unique arrays. Any ideas are appreciated!

+3


source to share


2 answers


Another version of @ Nyps answer:

def dominates(a, b):
    return (np.asarray(a) <= b).all()

      

This is a vectorized approach of your code using numpy.


It can still be slow if you need to go through all the lines you have. If you have a list with all strings and you want to compare them in pairs, you can use scipy to create an array N x N

(where N

is the number of strings).

import numpy as np
a = np.random.randint(0, 10, size=(1000, 10)) 

      

a

Here is an array 1000 x 10

simulating 1000

strings of 10

elements:

from scipy.spatial.distance import cdist
X = cdist(a, a, metric=dominates).astype(np.bool)

      

X

is now a matrix 1000 x 1000

containing a pairwise comparison of all records. This means it X[i, j]

contains True

if the sample i

dominates the sample j

or False

otherwise.



Now you can extract fancy results from X

, for example, the pattern that dominates them:

>>> a[50] = 0 # set a row to all 0s to fake a dominant row
>>> X = cdist(a, a, metric=dominates).astype(np.bool)
>>> non_dominated = np.where(X.all(axis=1))[0]
>>> non_dominated
array([50])

      

The sample in the position 50

is a ruler, if your population you should keep a close eye on it.


Now, if you only want to keep the dominant, you can do:

if non_dominated.size > 0:
    return [a[i] for i in non_dominated]
else: # no one dominates every other
    return a

      

As an example:

import numpy as np
from scipy.spatial.distance import cdist

def get_ruler(a):
    X = cdist(a, a, metric=dominates).astype(np.bool)
    rulers = np.where(X.all(axis=1))[0]
    if rulers.size > 0:
        return [a[i] for i in rulers]
    else: # no one dominates every other
        return a

      

+4


source


What about:

import numpy as np
np.all((np.asarry(l[1])-np.asarry(l[0]))>=0)

      



You can go in a similar way if you can immediately create your list as a numpy array, i.e. type(l) == np.ndarray

... Then the syntax will look like this:

np.all(p[1])-p[0])>=0)

      

+1


source







All Articles